SLURM HPC Cromwell implementation of GATK4 germline variant calling pipeline
See the GATK website for more information on this toolset
Assumptions
- Using hg38 human reference genome build
- Running using HPC/SLURM scheduling. This repo was specifically tested on Pawsey Zeus machine, primarily running in the
/scratch
partition. - Starting from short-read Illumina paired-end fastq files as input
Dependencies
The following versions have been tested and work, but GATK and Cromwell are regularly updated and so one must consider whether they would like to use newer versions of these tools.
- BWA/0.7.15
- GATK v4.0.6.0
- SAMtools/1.5
- picard/2.9
- Python/2.7
- Cromwell v61
Quick start guide
Installing and preparing environment for GATK4 with Cromwell
- Clone repository
git clone https://github.com/SarahBeecroft/slurmCromwellGATK4.git
cd slurmCromwellGATK4
chmod +x *.sh
-
Install Miniconda if you haven’t already. This is best placed in your
/group
directory to avoid filling your small/home
directory, or being purged is placed in the/scratch
directory. -
Create Conda environment using the supplied conda environment file
conda env create --file gatk4_pipeline.yml
-
Download the necessary .jar files
- The Cromwell workfow orchestration engine can be downloaded from https://github.com/broadinstitute/cromwell/releases/
- GATK can be downloaded from https://github.com/broadinstitute/gatk/releases. Unzip the file with
unzip
- Picard can be downloaded from https://github.com/broadinstitute/picard/releases/
-
If you do not have the resource bundle files already, these need to be downloaded. In future they will be cached on Pawsey systems. The bundle data should be download from the Google Cloud bucket and not from the FTP site, which is missing various files. Refer to this handy blog post on how to download the resource files using Google Cloud SDK. There is a Slurm script (download_bundle.slurm) that can be used to download all hg38 files from the Google Cloud bucket. The files were downloaded in /scratch/pawsey0001/sbeecroft/hg38/v0, which needs to be moved before the data becomes purged after 30 days. Note that Homo_sapiens_assembly38.dbsnp138.vcf.gz was from the FTP bundle as this file could not be downloaded using the Conda version of Google Cloud SDK.
Note that the hg38_wgs_scattered_calling_intervals.txt
will need to be to generated using the following:
cd
find `pwd` -name "scattered.interval_list" -print | sort > hg38_wgs_scattered_calling_intervals.txt
These files are required for Multisample_Fastq_to_Gvcf_GATK4.
Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.64.alt
Homo_sapiens_assembly38.fasta.64.amb
Homo_sapiens_assembly38.fasta.64.ann
Homo_sapiens_assembly38.fasta.64.bwt
Homo_sapiens_assembly38.fasta.64.pac
Homo_sapiens_assembly38.fasta.64.sa
Homo_sapiens_assembly38.fasta.amb
Homo_sapiens_assembly38.fasta.ann
Homo_sapiens_assembly38.fasta.bwt
Homo_sapiens_assembly38.fasta.pac
Homo_sapiens_assembly38.fasta.sa
Homo_sapiens_assembly38.dbsnp138.vcf.gz (needs to be gunzipped)
Homo_sapiens_assembly38.dbsnp138.vcf.idx
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
Homo_sapiens_assembly38.dbsnp138.vcf
Homo_sapiens_assembly38.dbsnp138.vcf.idx
Homo_sapiens_assembly38.known_indels.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
These files are required for Multisample_jointgt_GATK4.
wgs_evaluation_regions.hg38.interval_list
hg38.custom_100Mb.intervals
Homo_sapiens_assembly38.dbsnp138.vcf
Homo_sapiens_assembly38.dbsnp138.vcf.idx
1000G_phase1.snps.high_confidence.hg38.vcf.gz
1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
1000G_omni2.5.hg38.vcf.gz
1000G_omni2.5.hg38.vcf.gz.tbi
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.tbi
hapmap_3.3.hg38.vcf.gz
hapmap_3.3.hg38.vcf.gz.tbi
-
Set up the config files. Files that you need to edit with the correct paths to your data/jar files or other specific configurations are:
Multisample_Fastq_to_Gvcf_GATK4_inputs_hg38.json
Multisample_jointgt_GATK4_inputs_hg38.json
- both json files will need the correct paths to your reference file locations, and the file specifying your inputs i.e.
samples.txt
orgvcfs.txt
- both json files will need the correct paths to your reference file locations, and the file specifying your inputs i.e.
samples.txt
gvcfs.txt
- These are the sample input files (tab seperated)
- The format for samples.txt is sampleID, sampleID_readgroup, path_to_fastq_R1_file, path_to_fastq_R2_file,
- The format for gvcfs.txt is sample ID, gvcf, gvcf .tbi index file
- Examples are included in this repo
- NOTE: Having tabs, not spaces, is vital for parsing the file. Visual studio code tends to introduce spaces, so if you are having issues, check the file with another text editor such as sublime.
launch_cromwell.sh
launch_jointgt.sh
- These are the scripts which launch the pipeline.
launch_cromwell.sh
launches the fastq to gvcf stagelaunch_jointgt.sh
launched the gvcf joint genotyping to cohort vcf step. This is perfomed when you have run all samples through the fastq to gvcf stage.- Check the paths and parameters make sense for your machine
slurm.conf
- the main options here relate to the job scheduler. If you are running on Zeus at Pawsey, you should not need to alter these parameters.
cromwell.options
cromwell.options
requires editing to provide the directory where you would like the final workflow outputs to be written
Multisample_Fastq_to_Gvcf_GATK4.wdl
ruddle_fastq_to_gvcf_single_sample_gatk4.wdl
- The paths to your jar files will need to be updated
- The path to your conda
activate
binary will need to be updated (e.g./group/projectID/userID/miniconda/bin/activate
)
-
Launch the job using
sbatch launch_cromwell.sh
. When that has completed successfully, you can launch the second stage of the pipeline (joint calling) withsbatch launch_jointgt.sh
.
Overview of the steps in Multisample_Fastq_to_Gvcf_GATK4.wdl
This part of the pipeline takes short-read, Illumina paired-end fastq files as the input. The outputs generated are sorted, duplicate marked bam files and their indices, duplicate metric information, and a GVCF file for each sample. The GVCF files are used as input for the second part of the pipeline (joint genotyping).
FastqToUbam
GetBwaVersion
SamToFastqAndBwaMem
MergeBamAlignment
SortAndFixTags
MarkDuplicates
CreateSequenceGroupingTSV
BaseRecalibrator
GatherBqsrReports
ApplyBQSR
GatherBamFiles
HaplotypeCaller
MergeGVCFs
Overview of the steps in Multisample_jointgt_GATK4.wdl
This part of the pipeline takes GVCF files (one per sample), and performs joint genotyping across all of the provided samples. This means that old previously generated GVCFs can be joint-called with new GVCFs whenever you need to add new samples. The key output from this is a joint-genotyped, cohort-wide VCF file.
GetNumberOfSamples
ImportGVCFs
GenotypeGVCFs
HardFilterAndMakeSitesOnlyVcf
IndelsVariantRecalibrator
SNPsVariantRecalibratorCreateModel
SNPsVariantRecalibrator
GatherTranches
ApplyRecalibration
GatherVcfs
CollectVariantCallingMetrics
GatherMetrics
DynamicallyCombineIntervals
Version History
Version 1 (earliest) Created 17th Aug 2021 at 05:42 by Sarah Beecroft
Added/updated 1 files
Open
master
4de7ef6
Creators
Not specifiedSubmitter
Views: 4269 Downloads: 309
Created: 17th Aug 2021 at 05:42
Last updated: 17th Aug 2021 at 05:56
None