The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
Software Category: bio
For detailed information, visit the GATK website.
For a GitHub reference, visit: https://github.com/broadinstitute/gatk
The current installation of GATK incorporates the most popular packages. To find the available versions and learn how to load them, run:
module spider gatk
The output of the command shows the available GATK module versions.
For detailed information about a particular GATK
module, including how to load the module, run the
module spider command with the module’s full version label. For example:
module spider gatk/126.96.36.199
|Module||Version||Module Load Command|
|gatk||188.8.131.52||module load gatk/184.108.40.206|
|gatk||220.127.116.11||module load gatk/18.104.22.168|
|gatk||22.214.171.124||module load gatk/126.96.36.199|
Note: Make sure to invoke GATK using the
gatk wrapper script rather than calling the jar directly, because the wrapper will select the appropriate jar file (there are two!) and will set some parameters for you.
For help on using
gatk itself, run
To print a list of available tools, run
To print help for a particular tool, run
gatk ToolName --help
To run a GATK tool locally, the syntax is:
gatk ToolName toolArguments
Basic Usage Examples
Below are few trivial examples of using GATK4 tools in single-core mode.
PrintReads is a generic utility tool for manipulating sequencing data in SAM/BAM format.
In order to print all reads that have a mapping quality above zero in 2 input BAMs (say -
input2.bam) and write the output to
gatk PrintReads \ -I input1.bam \ -I input2.bam \ -O output.bam \ --read_filter MappingQualityZero
The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region.
Basic syntax for variant-only calling on DNAseq.
gatk --java-options "-Xmx4g" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam [-I sample2.bam ...] \ [--dbsnp dbSNP.vcf] \ [-strand_call_conf 30] \ [-L targets.interval_list] \ -o output.raw.snps.indels.vcf
Note: Here, we are setting the maximum Java heap size to 4GB. This argument varies based on the volume of data at-hand.
Note: If you are working with human reference genome, please refer the local genome repository on Rivanna at
/project/genomes/Homo_sapiens/ for the
reference.fasta, as well as the corresponding GATK data bundle at
/project/genomes/Homo_sapiens/GATK_bundle/, for resource files like the
1000G. No need to download them to your working directory.
For example: if you were to run
HaplotypeCaller on reference-aligned BAMs for 3 samples (say -
sample3-hg38.bam), accessing files from the Rivanna genomes repository.
gatk --java-option "-Xmx4g" HaplotypeCaller \ -R /project/genomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa \ -I sample1-hg38.bam \ -I sample2-hg38.bam \ -I sample3-hg38.bam \ --dbsnp /project/genomes/Homo_sapiens/GATK_bundle/hg38/dbsnp_146.hg38.vcf.gz \ -strand_call_conf 30 \ -o output.raw.snps.indels.vcf
The output will be written to the file -
output.raw.snps.indels.vcf, in the Variant Call Format.
Parallelism in GATK4
The concepts involved and their application within GATK are well explained in this article.
- In GATK3, there were two options for tools that supported multi-threading, controlled by the arguments
- In GATK4, tools take advantage of an open-source industry-standard Apache Spark software library.
Spark-enabled GATK tools
Not all GATK tools use Spark. Check the respective Tool Doc to make sure of Spark-capabilities.
Briefly; Spark is a piece of software that GATK4 uses to do multithreading, which is a form of parallelization that allows a computer (or cluster of computers) to finish executing a task sooner. You can read more about multithreading and parallelism in GATK here.
The “sparkified” versions have the suffix “Spark” at the end of their names. Many of these are still experimental; please carefully check for expected output, and validate against non-spark tools.
You DO NOT need a Spark cluster to run Spark-enabled GATK tools!
While working on Rivanna’s compute node (with multiple CPU cores), the GATK engine can use Spark to create a virtual standalone cluster in place, for its multi-threaded processing.
“local”-Spark Usage Example:
PrintReads tool we explored above has a Spark version called:
PrintReadsSpark. In order to set up a local Spark environment to run the same job using 8 threads, we can use the
gatk PrintReadsSpark \ --spark-master local \ -I input1.bam \ -I input2.bam \ -O output.bam \ --read_filter MappingQualityZero
Note: Make sure to request for 8 CPU cores before executing the above command, either by starting an interactive session using
ijob or by submitting the job via a SLURM batch submission script.
Below is an example
gatk-printReadsSpark.slurm.sh batch submission script for the above job.
#!/bin/bash #SBATCH --job-name=gatk-prs # Job name #SBATCH --nodes=1 # Number of nodes #SBATCH --cpus-per-task=8 # Number of CPU cores per task #SBATCH --mem=10gb # Job Memory #SBATCH --time=05:00:00 # Time limit hrs:min:sec #SBATCH --output=gatk-prs_%A.out # Standard output log #SBATCH --error=gatk-prs_%A.err # Standard error log #SBATCH -A <YOUR_ALLOCATION> # allocation name #SBATCH -p standard # slurm queue pwd; hostname; date # load gatk module, to make the wrapper script available for execution module load gatk # gatk command and arguments gatk --java-option "-Xmx8G" PrintReadsSpark \ --spark-master local \ -I input1.bam \ -I input2.bam \ -O output.bam \ --read_filter MappingQualityZero date
<YOUR_ALLOCATION> with your allocation group.
To submit the job.
To monitor the progress of the job.
jobq OR squeue -u <mst3k> # replace <mst3k> with your computing ID.