The framework used for the 1000 genome project, recalibrate, analyze, compare, ...
The Genome Analysis Toolkit or GATK([1]) is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size. GATK makes use of another Broad utility called Queue to perform full workflow analysis in an unsupervised manner and by applying a predefined tool sequence.
After registering to the Broad site, you can access a variety of reference data and tutorial at ([2]).
GATK manual page
The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36
Copyright (c) 2010 The Broad Institute
For support and documentation go to
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-args <arg_file>] [-I <input_file>] [-rbs <read_buffer_size>] [-et
<phone_home>] [-K <gatk_key>] [-tag <tag>] [-rf <read_filter>] [-L <intervals>] [-XL <excludeIntervals>] [-isr
<interval_set_rule>] [-im <interval_merging>] [-ip <interval_padding>] [-R <reference_sequence>] [-ndrs]
[--disableRandomization] [-maxRuntime <maxRuntime>] [-maxRuntimeUnits <maxRuntimeUnits>] [-dt <downsampling_type>]
[-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>]
[-fixMisencodedQuals] [-allowPotentiallyMisencodedQuals] [-PF <performanceLog>] [-OQ] [-BQSR <BQSR>] [-DIQ] [-EOQ]
[-preserveQ <preserve_qscores_less_than>] [-globalQScorePrior <globalQScorePrior>] [-allowBqsrOnReducedBams] [-DBQ
<defaultBaseQualities>] [-S <validation_strictness>] [-rpr] [-kpr] [-U <unsafe>] [-nt <num_threads>] [-nct
<num_cpu_threads_per_data_thread>] [-mte] [-bfh <num_bam_file_handles>] [-rgbl <read_group_black_list>] [-ped
<pedigree>] [-pedString <pedigreeString>] [-pedValidationType <pedigreeValidationType>] [-l <logging_level>] [-log
<log_to_file>] [-h] [-version]
-T,--analysis_type <analysis_type> Type of analysis to run
-args,--arg_file <arg_file> Reads arguments from the specified
-I,--input_file <input_file> SAM or BAM file(s)
-rbs,--read_buffer_size <read_buffer_size> Number of reads per SAM file to buffer
in memory
-et,--phone_home <phone_home> What kind of GATK run report should we
generate? STANDARD is the default, can
be NO_ET so nothing is posted to the
run repository. Please see
for details. (NO_ET|STANDARD|AWS|
-K,--gatk_key <gatk_key> GATK Key file. Required if running
with -et NO_ET. Please see
for details.
-tag,--tag <tag> Arbitrary tag string to identify this
GATK run as part of a group of runs,
for later analysis
-rf,--read_filter <read_filter> Specify filtration criteria to apply
to each read individually
-L,--intervals <intervals> One or more genomic intervals over
which to operate. Can be explicitly
specified on the command line or in a
file (including a rod file)
-XL,--excludeIntervals <excludeIntervals> One or more genomic intervals to
exclude from processing. Can be
explicitly specified on the command
line or in a file (including a rod
-isr,--interval_set_rule <interval_set_rule> Indicates the set merging approach the
interval parser should use to combine
the various -L or -XL inputs (UNION|
-im,--interval_merging <interval_merging> Indicates the interval merging rule we
should use for abutting intervals (ALL|
-ip,--interval_padding <interval_padding> Indicates how many basepairs of
padding to include around each of the
intervals specified with the
-L/--intervals argument
-R,--reference_sequence <reference_sequence> Reference sequence file
-ndrs,--nonDeterministicRandomSeed Makes the GATK behave non
deterministically, that is, the random
numbers generated will be different in
every run
--disableRandomization Completely eliminates randomization
from nondeterministic methods. To be
used mostly in the testing framework
where dynamic parallelism can result
in differing numbers of calls to the
-maxRuntime,--maxRuntime <maxRuntime> If provided, that GATK will stop
execution cleanly as soon after
maxRuntime has been exceeded,
truncating the run but not exiting
with a failure. By default the value
is interpreted in minutes, but this
can be changed by maxRuntimeUnits
-maxRuntimeUnits,--maxRuntimeUnits <maxRuntimeUnits> The TimeUnit for maxRuntime
-dt,--downsampling_type <downsampling_type> Type of reads downsampling to employ
at a given locus. Reads will be
selected randomly to be removed from
the pile based on the method described
-dfrac,--downsample_to_fraction <downsample_to_fraction> Fraction [0.0-1.0] of reads to
downsample to
-dcov,--downsample_to_coverage <downsample_to_coverage> Coverage [integer] to downsample to at
any given locus; note that downsampled
reads are randomly selected from all
possible reads at a locus. For
non-locus-based traversals (eg.,
ReadWalkers), this sets the maximum
number of reads at each alignment
start position.
-baq,--baq <baq> Type of BAQ calculation to apply in
-baqGOP,--baqGapOpenPenalty <baqGapOpenPenalty> BAQ gap open penalty (Phred Scaled).
Default value is 40. 30 is perhaps
better for whole genome call sets
-fixMisencodedQuals,--fix_misencoded_quality_scores Fix mis-encoded base quality scores
-allowPotentiallyMisencodedQuals,--allow_potentially_misencoded_quality_scores Do not fail when encountering base
qualities that are too high and that
seemingly indicate a problem with the
base quality encoding of the BAM file
-PF,--performanceLog <performanceLog> If provided, a GATK runtime
performance log will be written to
this file
-OQ,--useOriginalQualities If set, use the original base quality
scores from the OQ tag when present
instead of the standard scores
-BQSR,--BQSR <BQSR> The input covariates table file which
enables on-the-fly base quality score
-DIQ,--disable_indel_quals If true, disables printing of base
insertion and base deletion tags (with
-EOQ,--emit_original_quals If true, enables printing of the OQ
tag with the original base qualities
(with -BQSR)
-preserveQ,--preserve_qscores_less_than <preserve_qscores_less_than> Bases with quality scores less than
this threshold won't be recalibrated
(with -BQSR)
-globalQScorePrior,--globalQScorePrior <globalQScorePrior> The global Qscore Bayesian prior to
use in the BQSR. If specified, this
value will be used as the prior for
all mismatch quality scores instead of
the actual reported quality score
-allowBqsrOnReducedBams,--allow_bqsr_on_reduced_bams_despite_repeated_warnings Do not fail when running base quality
score recalibration on a reduced BAM
file even though we highly recommend
against it
-DBQ,--defaultBaseQualities <defaultBaseQualities> If reads are missing some or all base
quality scores, this value will be
used for all base quality scores
-S,--validation_strictness <validation_strictness> How strict should we be with
-rpr,--remove_program_records Should we override the Walker's
default and remove program records
from the SAM header
-kpr,--keep_program_records Should we override the Walker's
default and keep program records from
the SAM header
-U,--unsafe <unsafe> If set, enables unsafe operations:
nothing will be checked at runtime.
For expert users only who know what
they are doing. We do not support
usage of this argument.
-nt,--num_threads <num_threads> How many data threads should be
allocated to running this analysis.
-nct,--num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread> How many CPU threads should be
allocated per data thread to running
this analysis?
-mte,--monitorThreadEfficiency Enable GATK threading efficiency
-bfh,--num_bam_file_handles <num_bam_file_handles> The total number of BAM file handles
to keep open simultaneously
-rgbl,--read_group_black_list <read_group_black_list> Filters out read groups matching
<TAG>:<STRING> or a .txt file
containing the filter strings one per
-ped,--pedigree <pedigree> Pedigree files for samples
-pedString,--pedigreeString <pedigreeString> Pedigree string for samples
-pedValidationType,--pedigreeValidationType <pedigreeValidationType> How strict should we be in validating
the pedigree information? (STRICT|
-l,--logging_level <logging_level> Set the minimum level of logging, i.e.
setting INFO get's you INFO up to
FATAL, setting ERROR gets you ERROR
and FATAL level logging.
-log,--log_to_file <log_to_file> Set the logging location
-h,--help Generate this help message
-version,--version Output version information
CheckAlignment Validates consistency of the aligner interface by taking reads already aligned by BWA
in a BAM file, stripping them of their alignment data, realigning them, and making sure
one of the best resulting realignments matches the original alignment from the input
VariantAnnotator Annotates variant calls with context information.
BeagleOutputToVCF Takes files produced by Beagle imputation engine and creates a vcf with modified
ProduceBeagleInput Converts the input VCF into a format accepted by the Beagle imputation/analysis
VariantsToBeagleUnphased Produces an input file to Beagle imputation engine, listing unphased, hard-called
genotypes for a single sample in input variant file.
BaseRecalibrator First pass of the base quality score recalibration -- Generates recalibration table
based on various user-specified covariates (such as read group, reported quality score,
machine cycle, and nucleotide context).
CallableLoci Emits a data file containing information about callable, uncallable, poorly mapped, and
other parts of the genome <p/>
CompareCallableLoci Test routine for new VariantContext object
DepthOfCoverage Toolbox for assessing sequence coverage by a wide array of metrics, partitioned by
sample, read group, or library
GCContentByInterval Walks along reference and calculates the GC content for each interval.
CoveredByNSamplesSites print intervals file with all the variant sites that have "most" ( >= 90% by default)
of the samples with "good" (>= 10 by default)coverage ("most" and "good" can be set in
the command line).
ErrorRatePerCycle Computes the read error rate per position in read (in the original 5'->3' orientation
that the read had coming off the machine) Emits a GATKReport containing readgroup,
cycle, mismatches, counts, qual, and error rate for each read group in the input BAMs
ReadGroupProperties Emits a GATKReport containing read group, sample, library, platform, center, sequencing
data, paired end status, simple read type name (e.g.
ReadLengthDistribution Outputs the read lengths of all the reads in a file.
DiffObjects A generic engine for comparing tree-structured objects
GATKPaperGenotyper A simple Bayesian genotyper, that outputs a text based call format.
FastaAlternateReferenceMaker Generates an alternative reference sequence over the specified interval.
FastaReferenceMaker Renders a new reference in FASTA format consisting of only those loci provided in the
input data set.
FastaStats Calculates basic statistics about the reference sequence itself
VariantFiltration Filters variant calls using a number of user-selectable, parameterizable criteria.
UnifiedGenotyper A variant caller which unifies the approaches of several disparate callers -- Works for
single-sample and multi-sample data.
HaplotypeCaller Call SNPs and indels simultaneously via local de-novo assembly of haplotypes in an
active region.
HaplotypeResolver Haplotype-based resolution of variants in 2 different eval files.
IndelRealigner Performs local realignment of reads based on misalignments due to the presence of
LeftAlignIndels Left-aligns indels from reads in a bam file.
RealignerTargetCreator Emits intervals for the Local Indel Realigner to target for realignment.
PhaseByTransmission Computes the most likely genotype combination and phases trios and parent/child pairs
ReadBackedPhasing Walks along all variant ROD loci, caching a user-defined window of VariantContext
sites, and then finishes phasing them when they go out of range (using upstream and
downstream reads).
CheckPileup At every locus in the input set, compares the pileup data (reference base, aligned base
from each overlapping read, and quality score) to the reference pileup data generated
by samtools.
CountBases Walks over the input data set, calculating the number of bases seen for diagnostic
CountIntervals Counts the number of contiguous regions the walker traverses over.
CountLoci Walks over the input data set, calculating the total number of covered loci for
diagnostic purposes.
CountMales Walks over the input data set, calculating the number of reads seen from male samples
for diagnostic purposes.
CountReadEvents Walks over the input data set, counting the number of read events (from the CIGAR
CountReads Walks over the input data set, calculating the number of reads seen for diagnostic
CountRODs Prints out counts of the number of reference ordered data objects encountered.
CountRODsByRef Prints out counts of the number of reference ordered data objects encountered.
CountTerminusEvent Walks over the input data set, counting the number of reads ending in
insertions/deletions or soft-clips
FlagStat A reimplementation of the 'samtools flagstat' subcommand in the GATK.
Pileup Prints the alignment in something similar to the samtools pileup format.
PrintRODs Prints out all of the RODs in the input data set.
QCRef Quality control for the reference fasta
ReadClippingStats Walks over the input reads, printing out statistics about the read length, number of
clipping events, and length of the clipping to the output stream.
ClipReads This tool provides simple, powerful read clipping capabilities to remove low quality
strings of bases, sections of reads, and reads containing user-provided sequences.
PrintReads Renders, in SAM/BAM format, all reads from the input data set in the order in which
they appear in the input file.
SplitSamFile Divides the input data set into separate BAM files, one for each sample in the input
data set.
CompareBAM Given two BAMs with different read groups, it compares them based on ReduceReads
ReduceReads Reduces the BAM file using read based compression that keeps only essential information
for variant calling
BaseCoverageDistribution Simple walker to plot the coverage distribution per base.
DiagnoseTargets Analyzes coverage distribution and validates read mates for a given interval and
GenotypeAndValidate Genotypes a dataset and validates the calls of another dataset using the Unified
ValidationAmplicons Creates FASTA sequences for use in Seqenom or PCR utilities for site amplification and
subsequent validation
ValidationSiteSelector Randomly selects VCF records according to specified options.
VariantEval General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv
ratios, and a lot more)
ApplyRecalibration Applies cuts to the input vcf file (by adding filter lines) to achieve the desired
novel truth sensitivity levels which were specified during VariantRecalibration
VariantRecalibrator Create a Gaussian mixture model by looking at the annotations values over a high
quality subset of the input call set and then evaluate all input variants.
CombineVariants Combines VCF records from different sources.
FilterLiftedVariants Filters a lifted-over VCF file for ref bases that have been changed.
GenotypeConcordance A simple walker for performing genotype concordance calculations between two callsets.
LeftAlignVariants Left-aligns indels from a variants file.
LiftoverVariants Lifts a VCF file over from one build to another.
RandomlySplitVariants Takes a VCF file, randomly splits variants into two different sets, and outputs 2 new
VCFs with the results.
RegenotypeVariants Regenotypes the variants from a VCF.
SelectHeaders Selects headers from a VCF source.
SelectVariants Selects variants from a VCF source.
ValidateVariants Validates a VCF file with an extra strict set of criteria.
VariantsToBinaryPed Converts a VCF file to a binary plink Ped file (.bed/.bim/.fam)
VariantsToTable Emits specific fields from a VCF file to a tab-deliminated table
VariantsToVCF Converts variants from other file formats to VCF format.
VariantValidationAssessor Annotates a validation (from Sequenom for example) VCF with QC metrics (HW-equilibrium,
% failed probes)
- ↑
- ↑