Variant Calling Workflow tutorial
This guide provides a flexible workflow for analyzing next-generation sequencing (NGS) data using Galaxy. The tutorial walks you through each step of the process, from setting up your Galaxy environment to interpreting results, ensuring a comprehensive approach to variant analysis. A practice exercise is available to reinforce learning and provide hands-on experience for your project.
Practice Example
- Download the sample dataset from Figshare: Click to Download Dataset
- Access the example assessment form at: Practice Exercise
- Follow the steps provided in this tutorial to analyze a sample dataset.
- Answer the questions based on your findings and submit your responses.
Tutorial Sections
- General Prerequisites
- 1. Set Up Your Galaxy Environment
- 2. Understand Your Sequencing Data
- 3. Align Reads to a Reference Genome
- 4. Call Variants
- 5. Annotate Variants
- 6. Visualize Variants
- 7. Analyze and Interpret Results
General Prerequisites
-
Objective: Analyze NGS data to identify and interpret variants for
[SPECIFIC_PURPOSE, e.g., clinical reporting, research discovery]
. -
Input Data:
- Paired-end FASTQ files:
[SAMPLE_NAME]_R1.fq
,[SAMPLE_NAME]_R2.fq
- BED file:
[TARGET_REGIONS].bed
(defines regions of interest)
- Paired-end FASTQ files:
- Tools Needed:
- Hardware: Computer with 100MB–1GB storage, large or dual-monitor setup recommended.
Set Up Your Galaxy Environment
Goal: Establish a workspace for data analysis.
Steps:
- Visit https://usegalaxy.org/ and click “Login or Register” to create an account (use Firefox, Chrome, or Safari).
- Upload or request your instructor/collaborator to share the dataset (e.g.,
[SAMPLE_NAME]_R1.fq
,[SAMPLE_NAME]_R2.fq
,[TARGET_REGIONS].bed
) within Galaxy. - Access shared data:
- “History” panel (right) > “Histories Shared with Me”
- Select shared history > “View” > “Import History” > “Start Using This History”
Customization: Replace [SAMPLE_NAME]
(e.g., TumorSample1
) and [TARGET_REGIONS]
(e.g., CancerPanel.bed
).
Understand Your Sequencing Data
Goal: Assess the quality and structure of raw sequencing data.
Steps:
-
View FASTQ Data:
- In “History,” click the “eye” icon next to
[SAMPLE_NAME]_R1.fq
. - Examine FASTQ format:
-
- Record run number (e.g., 585).
- Interpret quality scores using ASCII Phred table (e.g., “5” = 1% error, “B” = 0.05% error).
- Record run number (e.g., 585).
- In “History,” click the “eye” icon next to
-
Run Quality Control:
- Tools > “FASTQ Quality Control” > “FastQC Read Quality Reports”
- Input:
[SAMPLE_NAME]_R1.fq
> “Run Tool.” Repeat for[SAMPLE_NAME]_R2.fq
. - View “webpage” output and record:
- Total sequences
- Sequences flagged as poor quality
- Sequence length
- %GC content
- Assess quality metrics for concerns (e.g., low base quality).
Customization: Adjust quality thresholds based on project needs.
Align Reads to a Reference Genome
Goal: Map sequencing reads to a reference genome.
Steps:
-
Align with BWA-MEM:
- Tools > “Mapping” > “Map with BWA-MEM” (Version 0.7.17.1 or latest)
- Settings:
- Reference genome:
[REFERENCE_GENOME]
(e.g.,Human hg19
) - Paired-end: Yes
- FASTQ files:
[SAMPLE_NAME]_R1.fq
,[SAMPLE_NAME]_R2.fq
- Run Tool (5–15 min). Output:
[SAMPLE_NAME].bam
- Note size difference (e.g., 600MB vs. 1GB FASTQ due to compression).
-
Sort the BAM File:
- Tools > “Genomics ToolKits” > “Picard” > “SortSam” (Version 2.18.2.1)
- Input:
[SAMPLE_NAME].bam
- Sort order: “Coordinate”
- Validation stringency: “Lenient”
- Run Tool
-
Mark Duplicates:
- Tools > “Picard” > “MarkDuplicates” (Version 2.18.2.1)
- Input: Sorted
[SAMPLE_NAME].bam
- Run Tool. Outputs: Marked BAM, metrics file
- View metrics (e.g., % PCR duplicates).
-
Remove Duplicates:
- Tools > “SAM/BAM” > “RmDup” (Version 2.0.1)
- Input: Original
[SAMPLE_NAME].bam
- Paired-end: Yes
- Run Tool. Compare sizes (e.g., % reduction vs. % duplicates).
Alternative Approach: Running Without Removing Duplicates
While removing duplicates is a common practice to reduce PCR artifacts, in some cases, it might be beneficial to retain duplicates:
- Low Input DNA: If the sample has a low DNA input, duplicate removal may significantly reduce coverage, potentially affecting variant detection.
- Unique Molecular Identifiers (UMIs): If UMIs were used, duplicates can be tracked and distinguished, making removal unnecessary.
- Somatic Variant Calling: Retaining duplicates may help in detecting low-frequency somatic mutations in cancer studies.
- Amplicon-Based Sequencing: In targeted sequencing approaches (e.g., for viral or bacterial genomes), duplicates may represent true coverage rather than PCR artifacts.
To run the pipeline without removing duplicates, simply skip the “Remove Duplicates” step and proceed with variant calling as usual.
Call Variants
Goal: Identify genetic variants from aligned reads.
Steps:
-
VarScan Workflow:
- Tools > “SAM/BAM” > “Samtools mpileup” (Version 2.1.4)
- Input: Duplicate-removed
[SAMPLE_NAME].bam
- Reference:
[REFERENCE_GENOME]
- Run Tool → Output: Pileup file
- Tools > “Variant Calling” > “VarScan” (Version 2.4.2)
- Input: Pileup file
- Settings:
- Min read depth: 20
- Min supporting reads: 4
- Min base quality: 20
- Min variant frequency: 0.03
- Ignore >90% strand bias: Yes
- Run Tool → Output:
[SAMPLE_NAME]_VarScan.vcf
-
FreeBayes Workflow:
- Tools > “Variant Calling” > “FreeBayes” (Version 1.1.0.46-0)
- Input: Duplicate-removed
[SAMPLE_NAME].bam
- Reference:
[REFERENCE_GENOME]
- Parameters: Simple diploid calling, Min coverage: 10
- Run Tool → Output:
[SAMPLE_NAME]_FreeBayes.vcf
-
Limit to Target Regions:
- Tools > “VCF/BCF” > “VCF-BEDintersect” (Version 1.0.0)
- Inputs: Each VCF +
[TARGET_REGIONS].bed
- Run Tool (twice). Outputs: Filtered VCFs
-
Compare Variant Callers:
- Click VCF files in History to view line counts.
- Record pre/post-BED intersect counts (e.g., VarScan: 9,535 → 48).
Customization: Add tools (e.g., GATK) or adjust thresholds.
Annotate Variants
Goal: Add functional and clinical context to variants.
Steps:
-
wANNOVAR Annotation:
- Download filtered VCFs (e.g.,
[SAMPLE_NAME]_VarScan.bed.vcf
) via “disk” icon. - Visit http://wannovar.wglab.org/, upload VCF (max 2 runs, use
[YOUR_EMAIL]
). - Download “genome” CSV.
- In Excel:
- Sort by
gnomad211_exome_AF
- Filter
ExonicFunc.refGeneWithVer
(exclude “Synonymous SNV”) - Filter
Func.refGeneWithVer
(exclude “intron”, “UTR3”)
- Download filtered VCFs (e.g.,
-
VEP Annotation:
- Visit http://grch37.ensembl.org/Homo_sapiens/Tools/VEP
- Input: Paste variant (e.g.,
chr7 55221822 55221822 C T
) or upload VCF - Run and review annotations.
-
Compare Results:
- Identify shared and unique variants (e.g., TP53 hits).
Customization: Use SnpEff or filter by ClinVar.
Visualize Variants
Goal: Inspect alignments and variants visually.
Steps:
-
Slice BAM File:
- Tools > “SAM/BAM” > “Slice BAM by genomic regions”
- Input: Original
[SAMPLE_NAME].bam
+[TARGET_REGIONS].bed
- Run Tool → Download
.bam
and.bai
-
View in IGV:
- Install IGV, load
[REFERENCE_GENOME]
(e.g.,hg19
) - Load
.bam
and.bai
(same name/folder) - Zoom to variant (e.g.,
7:140434574
) - Sort reads by base (right-click > “Group by base”)
- Check for artifacts (e.g., strand bias).
- Install IGV, load
Analyze and Interpret Results
Goal: Generate insights for [SPECIFIC_PURPOSE]
.
Steps:
- Apply pipeline to your dataset (e.g.,
[YOUR_SAMPLE]_R1.fq
). - Identify key variants (e.g., somatic, actionable).
- Compare with external data (e.g., COSMIC, clinical reports).
- Document in a report/presentation.
Customization: Focus on your goal (e.g., cancer, rare disease).
For further assistance, adapt this pipeline and contact Felicia Gomez!