NGS Exercise.5 GATK
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5 | NGS Exercise.6 ]
Call variations using the GATK Haplotype Caller workflow (not covered today)
Due to license limitations by the Broad Insitute, the GATK cannot be taught to a mixed audience of VIB and non-VIB scientists. Resukts presented here are used to validate training data but will therefore not be generated during the session.
We only consider here the BWA mem mapping data as this recent algorithm is considered more performant. Similarly, we only aply the GATK Haplotype Caller which is becoming the new standard and applies to diploid genomes like the one selected today
Contents
- 1 Introduction
- 2 Prepare the BWA-mem BAM data for GATK analysis
- 3 Obtain the required gatk bundle reference files
- 4 Improve mapping by local realignment (indels)
- 5 Analyze patterns of covariation in the sequence dataset with gatk BaseRecalibrator
- 6 Compress the sequence data with gatk ReduceReads
- 7 Call raw SNP and Indel variants with gatk HaplotypeCaller
- 8 Recalibrate variant calls for SNPs and Indels
- 9 Filter high quality data for comparison analysis using snpEff
- 10 Perform QC on the filtered VCF calls
- 11 Estimate variant concordance between bcftools/samtools and gatk pipelines
- 12 download exercise files
Introduction
GATK heavily relies on picard tools for working with different file formats. It is therefore very important that input data be Picard compliant before tempting to use GATK. The first part of the NGS session was dedicated to improving the original hg18 data and make it more Picard-able. this kind of data reshaping will often be required when you want to use GATK. Detailed information is available on the GATK pages to perform this cleaning and inspired content for several exercises performed today. The Broad GATK wiki and learning platform as well as excellent video presentation from this summer workshop are available online and should be viewed before attempting your own analyses.
list of gatk main commands
-------------------------------------------------------------------------------- The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 Copyright (c) 2010 The Broad Institute For support and documentation go to http://www.broadinstitute.org/gatk -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-args <arg_file>] [-I <input_file>] [-rbs <read_buffer_size>] [-et <phone_home>] [-K <gatk_key>] [-tag <tag>] [-rf <read_filter>] [-L <intervals>] [-XL <excludeIntervals>] [-isr <interval_set_rule>] [-im <interval_merging>] [-ip <interval_padding>] [-R <reference_sequence>] [-ndrs] [-maxRuntime <maxRuntime>] [-maxRuntimeUnits <maxRuntimeUnits>] [-dt <downsampling_type>] [-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>] [-fixMisencodedQuals] [-allowPotentiallyMisencodedQuals] [-OQ] [-DBQ <defaultBaseQualities>] [-PF <performanceLog>] [-BQSR <BQSR>] [-DIQ] [-EOQ] [-preserveQ <preserve_qscores_less_than>] [-globalQScorePrior <globalQScorePrior>] [-allowBqsrOnReducedBams] [-S <validation_strictness>] [-rpr] [-kpr] [-sample_rename_mapping_file <sample_rename_mapping_file>] [-U <unsafe>] [-nt <num_threads>] [-nct <num_cpu_threads_per_data_thread>] [-mte] [-bfh <num_bam_file_handles>] [-rgbl <read_group_black_list>] [-ped <pedigree>] [-pedString <pedigreeString>] [-pedValidationType <pedigreeValidationType>] [-l <logging_level>] [-log <log_to_file>] [-h] [-version] -T,--analysis_type <analysis_type> Type of analysis to run -args,--arg_file <arg_file> Reads arguments from the specified file -I,--input_file <input_file> SAM or BAM file(s) -rbs,--read_buffer_size <read_buffer_size> Number of reads per SAM file to buffer in memory -et,--phone_home <phone_home> What kind of GATK run report should we generate? AWS is the default, can be NO_ET so nothing is posted to the run repository. Please see nd-how-does-it-affect-me#latest for details. (NO_ET|AWS|STDOUT) -K,--gatk_key <gatk_key> GATK Key file. Required if running with -et NO_ET. Please see nd-how-does-it-affect-me#latest for details. -tag,--tag <tag> Arbitrary tag string to identify this GATK run as part of a group of runs, for later analysis -rf,--read_filter <read_filter> Specify filtration criteria to apply to each read individually -L,--intervals <intervals> One or more genomic intervals over which to operate. Can be explicitly specified on the command line or in a file (including a rod file) -XL,--excludeIntervals <excludeIntervals> One or more genomic intervals to exclude from processing. Can be explicitly specified on the command line or in a file (including a rod file) -isr,--interval_set_rule <interval_set_rule> Indicates the set merging approach the interval parser should use to combine the various -L or -XL inputs (UNION| INTERSECTION) -im,--interval_merging <interval_merging> Indicates the interval merging rule we should use for abutting intervals (ALL| OVERLAPPING_ONLY) -ip,--interval_padding <interval_padding> Indicates how many basepairs of padding to include around each of the intervals specified with the -L/--intervals argument -R,--reference_sequence <reference_sequence> Reference sequence file -ndrs,--nonDeterministicRandomSeed Makes the GATK behave non deterministically, that is, the random numbers generated will be different in every run -maxRuntime,--maxRuntime <maxRuntime> If provided, that GATK will stop execution cleanly as soon after maxRuntime has been exceeded, truncating the run but not exiting with a failure. By default the value is interpreted in minutes, but this can be changed by maxRuntimeUnits -maxRuntimeUnits,--maxRuntimeUnits <maxRuntimeUnits> The TimeUnit for maxRuntime (NANOSECONDS|MICROSECONDS| MILLISECONDS|SECONDS|MINUTES| HOURS|DAYS) -dt,--downsampling_type <downsampling_type> Type of reads downsampling to employ at a given locus. Reads will be selected randomly to be removed from the pile based on the method described here (NONE| ALL_READS|BY_SAMPLE) -dfrac,--downsample_to_fraction <downsample_to_fraction> Fraction [0.0-1.0] of reads to downsample to -dcov,--downsample_to_coverage <downsample_to_coverage> Coverage [integer] to downsample to. For locus-based traversals (eg., LocusWalkers and ActiveRegionWalkers),this controls the maximum depth of coverage at each locus. For non-locus-based traversals (eg., ReadWalkers), this controls the maximum number of reads sharing the same alignment start position. Note that this downsampling option does NOT produce an unbiased random sampling from all available reads at each locus: instead, the primary goal of the to-coverage downsampler is to maintain an even representation of reads from all alignment start positions when removing excess coverage. For a true across-the-board unbiased random sampling of reads, use -dfrac instead. Also note that the coverage target is an approximate goal that is not guaranteed to be met exactly: the downsampling algorithm will under some circumstances retain slightly more coverage than requested. -baq,--baq <baq> Type of BAQ calculation to apply in the engine (OFF| CALCULATE_AS_NECESSARY| RECALCULATE) -baqGOP,--baqGapOpenPenalty <baqGapOpenPenalty> BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets -fixMisencodedQuals,--fix_misencoded_quality_scores Fix mis-encoded base quality scores -allowPotentiallyMisencodedQuals,--allow_potentially_misencoded_quality_scores Do not fail when encountering base qualities that are too high and that seemingly indicate a problem with the base quality encoding of the BAM file -OQ,--useOriginalQualities If set, use the original base quality scores from the OQ tag when present instead of the standard scores -DBQ,--defaultBaseQualities <defaultBaseQualities> If reads are missing some or all base quality scores, this value will be used for all base quality scores -PF,--performanceLog <performanceLog> If provided, a GATK runtime performance log will be written to this file -BQSR,--BQSR <BQSR> The input covariates table file which enables on-the-fly base quality score recalibration (intended for use with BaseRecalibrator and PrintReads) -DIQ,--disable_indel_quals If true, disables printing of base insertion and base deletion tags (with -BQSR) -EOQ,--emit_original_quals If true, enables printing of the OQ tag with the original base qualities (with -BQSR) -preserveQ,--preserve_qscores_less_than <preserve_qscores_less_than> Bases with quality scores less than this threshold won't be recalibrated (with -BQSR) -globalQScorePrior,--globalQScorePrior <globalQScorePrior> The global Qscore Bayesian prior to use in the BQSR. If specified, this value will be used as the prior for all mismatch quality scores instead of the actual reported quality score -allowBqsrOnReducedBams,--allow_bqsr_on_reduced_bams_despite_repeated_warnings Do not fail when running base quality score recalibration on a reduced BAM file even though we highly recommend against it -S,--validation_strictness <validation_strictness> How strict should we be with validation (STRICT|LENIENT| SILENT) -rpr,--remove_program_records Should we override the Walker's default and remove program records from the SAM header -kpr,--keep_program_records Should we override the Walker's default and keep program records from the SAM header -sample_rename_mapping_file,--sample_rename_mapping_file <sample_rename_mapping_file> Rename sample IDs on-the-fly at runtime using the provided mapping file. This option requires that each BAM file listed in the mapping file have only a single sample specified in its header (though there may be multiple read groups for that sample). Each line of the mapping file must contain the absolute path to a BAM file, followed by whitespace, followed by the new sample name for that BAM file. -U,--unsafe <unsafe> If set, enables unsafe operations: nothing will be checked at runtime. For expert users only who know what they are doing. We do not support usage of this argument. (ALLOW_N_CIGAR_READS| ALLOW_UNINDEXED_BAM| ALLOW_UNSET_BAM_SORT_ORDER| NO_READ_ORDER_VERIFICATION| ALLOW_SEQ_DICT_INCOMPATIBILITY| LENIENT_VCF_PROCESSING|ALL) -nt,--num_threads <num_threads> How many data threads should be allocated to running this analysis. -nct,--num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread> How many CPU threads should be allocated per data thread to running this analysis? -mte,--monitorThreadEfficiency Enable GATK threading efficiency monitoring -bfh,--num_bam_file_handles <num_bam_file_handles> The total number of BAM file handles to keep open simultaneously -rgbl,--read_group_black_list <read_group_black_list> Filters out read groups matching <TAG>:<STRING> or a .txt file containing the filter strings one per line. -ped,--pedigree <pedigree> Pedigree files for samples -pedString,--pedigreeString <pedigreeString> Pedigree string for samples -pedValidationType,--pedigreeValidationType <pedigreeValidationType> How strict should we be in validating the pedigree information? (STRICT|SILENT) -l,--logging_level <logging_level> Set the minimum level of logging, i.e. setting INFO get's you INFO up to FATAL, setting ERROR gets you ERROR and FATAL level logging. -log,--log_to_file <log_to_file> Set the logging location -h,--help Generate this help message -version,--version Output version information alignment CheckAlignment Validates consistency of the aligner interface annotator VariantAnnotator Annotates variant calls with context information. beagle BeagleOutputToVCF Takes files produced by Beagle imputation engine and creates a vcf with modified annotations. ProduceBeagleInput Converts the input VCF into a format accepted by the Beagle imputation/analysis program. VariantsToBeagleUnphased Produces an input file to Beagle imputation engine, listing unphased, hard-called genotypes for a single sample in input variant file. bqsr AnalyzeCovariates Tool to analyze and evaluate base recalibration ables. BaseRecalibrator First pass of the base quality score recalibration -- Generates recalibration table based on various user-specified covariates (such as read group, reported quality score, machine cycle, and nucleotide context). RecalibrationPerformance Evaluate the performance of the base recalibration process coverage CallableLoci Emits a data file containing information about callable, uncallable, poorly mapped, and other parts of the genome <p/> CompareCallableLoci Test routine for new VariantContext object DepthOfCoverage Toolbox for assessing sequence coverage by a wide array of metrics, partitioned by sample, read group, or library GCContentByInterval Walks along reference and calculates the GC content for each interval. diagnosetargets DiagnoseTargets Analyzes coverage distribution and validates read mates for a given interval and sample. diagnostics BaseCoverageDistribution Simple walker to plot the coverage distribution per base CoveredByNSamplesSites Print intervals file with all the variant sites for which most of the samples have good coverage ErrorRatePerCycle Computes the read error rate per position in read (in the original 5'->3' orientation that the read had coming off the machine) Emits a GATKReport containing readgroup, cycle, mismatches, counts, qual, and error rate for each read group in the input BAMs FOR ONLY THE FIRST OF PAIR READS. FindCoveredIntervals Outputs a list of intervals that are covered above a given threshold. ReadGroupProperties Emits a GATKReport containing read group, sample, library, platform, center, sequencing data, paired end status, simple read type name (e.g. ReadLengthDistribution Outputs the read lengths of all the reads in a file. diffengine DiffObjects A generic engine for comparing tree-structured objects examples GATKPaperGenotyper A simple Bayesian genotyper, that outputs a text based call format. fasta FastaAlternateReferenceMaker Generates an alternative reference sequence over the specified interval. FastaReferenceMaker Renders a new reference in FASTA format consisting of only those loci provided in the input data set. FastaStats Calculate basic statistics about the reference sequence itself filters VariantFiltration Filters variant calls using a number of user-selectable, parameterizable criteria. genotyper UnifiedGenotyper A variant caller which unifies the approaches of several disparate callers -- Works for single-sample and multi-sample data. haplotypecaller HaplotypeCaller Call SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. HaplotypeResolver Haplotype-based resolution of variants in 2 different eval files. indels IndelRealigner Performs local realignment of reads to correct misalignments due to the presence of indels. LeftAlignIndels Left-aligns indels from reads in a bam file. RealignerTargetCreator Emits intervals for the Local Indel Realigner to target for realignment. missing QualifyMissingIntervals Walks along reference and calculates a few metrics for each interval. phasing PhaseByTransmission Computes the most likely genotype combination and phases trios and parent/child pairs ReadBackedPhasing Walks along all variant ROD loci, caching a user-defined window of VariantContext sites, and then finishes phasing them when they go out of range (using upstream and downstream reads). qc CheckPileup At every locus in the input set, compares the pileup data (reference base, aligned base from each overlapping read, and quality score) to the reference pileup data generated by samtools. CountBases Walks over the input data set, calculating the number of bases seen for diagnostic purposes. CountIntervals Count contiguous regions in an interval list. CountLoci Walks over the input data set, calculating the total number of covered loci for diagnostic purposes. CountMales Walks over the input data set, calculating the number of reads seen from male samples for diagnostic purposes. CountReadEvents Walks over the input data set, counting the number of read events (from the CIGAR operator) CountReads Walks over the input data set, calculating the number of reads seen for diagnostic purposes. CountRODs Prints out counts of the number of reference ordered data objects encountered. CountRODsByRef Prints out counts of the number of reference ordered data objects encountered along the reference. CountTerminusEvent Walks over the input data set, counting the number of reads ending in insertions/deletions or soft-clips FlagStat A reimplementation of the 'samtools flagstat' subcommand in the GATK Pileup Emulates the samtools pileup command to print aligned reads PrintRODs Prints out all of the RODs in the input data set. QCRef Quality control for the reference fasta ReadClippingStats Walks over the input reads, printing out statistics about the read length, number of clipping events, and length of the clipping to the output stream. readutils ClipReads This tool provides simple, powerful read clipping capabilities to remove low quality strings of bases, sections of reads, and reads containing user-provided sequences. PrintReads Renders, in SAM/BAM format, all reads from the input data set in the order in which they appear in the input file. ReadAdaptorTrimmer Utility tool to blindly strip base adaptors. SplitSamFile Divides the input data set into separate BAM files, one for each sample in the input data set. reducereads CompareBAM Given two BAMs with different read groups, it compares them based on ReduceReads metrics. ReduceReads Reduces the BAM file using read based compression that keeps only essential information for variant calling validation GenotypeAndValidate Genotypes a dataset and validates the calls of another dataset using the Unified Genotyper. ValidationAmplicons Creates FASTA sequences for use in Seqenom or PCR utilities for site amplification and subsequent validation validationsiteselector ValidationSiteSelector Randomly selects VCF records according to specified options. varianteval VariantEval General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv ratios, and a lot more) variantrecalibration ApplyRecalibration Applies cuts to the input vcf file (by adding filter lines) to achieve the desired novel truth sensitivity levels which were specified during VariantRecalibration VariantRecalibrator Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. variantutils CombineVariants Combines VCF records from different sources. FilterLiftedVariants Filters a lifted-over VCF file for ref bases that have been changed. GenotypeConcordance Genotype concordance (per-sample and aggregate counts and frequencies, NRD/NRS and site allele overlaps) between two callsets LeftAlignAndTrimVariants Left-aligns indels from a variants file. LiftoverVariants Lifts a VCF file over from one build to another. RandomlySplitVariants Takes a VCF file, randomly splits variants into two different sets, and outputs 2 new VCFs with the results. RegenotypeVariants Regenotypes the variants from a VCF. SelectHeaders Selects headers from a VCF source. SelectVariants Selects variants from a VCF source. ValidateVariants Validates a VCF file with an extra strict set of criteria. VariantsToAllelicPrimitives Takes alleles from a variants file and breaks them up (if possible) into more basic/primitive alleles. VariantsToBinaryPed Converts a VCF file to a binary plink Ped file (.bed/.bim/.fam) VariantsToTable Emits specific fields from a VCF file to a tab-deliminated table VariantsToVCF Converts variants from other file formats to VCF format. VariantValidationAssessor Annotates a validation (from Sequenom for example) VCF with QC metrics (HW-equilibrium, % failed probes)
Links:
- gatk online resources register with academic credentials for download access[1]
- BroadE: 2013 GATK Workshop Videos on youtube subscribe to the channel[2]
Prepare the BWA-mem BAM data for GATK analysis
This part of the tutorial was developed from a GATK wiki page dedicated to this specific issue link[3]
The steps performed below were already part of other exerscises and are repeated here for completeness. If the full training was already performed, users can safely jump to the later part of this document apply Picard BuildBamIndex.jar
apply Picard SortSam.jar
Freshly mapped reads from the BWA mem method are first sorted by read name using Picard.
infolder=Final_Results/hg19_bwa-mapping infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.bam outfile=nmsrt_NA18507-mem.bam outfolder=gatk_variants mkdir -p ${outfolder} # sort mappings by name to keep paired reads together java -Xmx6g -jar $PICARD/SortSam.jar \ I=${infolder}/${infile} \ O=${outfolder}/${outfile} \ SO=queryname \ VALIDATION_STRINGENCY=LENIENT \ 2>${infolder}/SortSam_queryname.err
apply Picard FixMateInformation.jar
We now fix mate information and sort by coordinate for the next step.
infolder=gatk_variants infile=nmsrt_NA18507-mem.bam outfile=fxmt_nmsrt_NA18507-mem.bam java -Xmx6g -jar $PICARD/FixMateInformation.jar \ I=${infolder}/${infile} \ O=${infolder}/${outfile} \ SO=coordinate \ VALIDATION_STRINGENCY=LENIENT \ 2>${infolder}/FixMateInformation.err
apply Picard MarkDuplicates.jar
Marking duplicate reads avoids counting such reads several times when calling variants leading to excessive coverage depth of variable nucleotides.
infolder=gatk_variants infile=fxmt_nmsrt_NA18507-mem.bam outfile=mdup_fxmt_nmsrt_NA18507-mem.bam java -Xmx6g -jar $PICARD/MarkDuplicates.jar \ I=${infolder}/${infile} \ O=${infolder}/${outfile} \ M=${infolder}/duplicate_metrics.txt \ VALIDATION_STRINGENCY=LENIENT \ 2>${infolder}/MarkDuplicates.err # review MarkDuplicates.jar summary counts reported in the 10 first lines head -10 ${infolder}/duplicate_metrics.txt | grep -A1 "^LIBRARY" | transpose -t | column -t
duplicate metrics
LIBRARY lib-NA18507 UNPAIRED_READS_EXAMINED 17268 READ_PAIRS_EXAMINED 7477196 UNMAPPED_READS 18534 UNPAIRED_READ_DUPLICATES 4725 READ_PAIR_DUPLICATES 5428 READ_PAIR_OPTICAL_DUPLICATES 1310 PERCENT_DUPLICATION 0.001041 ESTIMATED_LIBRARY_SIZE 6783431981
apply Picard AddOrReplaceReadGroups.jar
The presence of at least one @RG lien is prerequisite for GATK to work.
infolder=gatk_variants infile=mdup_fxmt_nmsrt_NA18507-mem.bam outfile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam # ID:NA18507 # LB=lib-NA18507 # PU=unkn-0.0 # PL:ILLUMINA # SM:GAIIx-chr21-BWA.mem java -Xmx6g -jar $PICARD/AddOrReplaceReadGroups.jar \ I=${infolder}/${infile} \ O=${infolder}/${outfile} \ RGID="NA18507" \ RGLB="lib-NA18507" \ RGPL="ILLUMINA" \ RGPU="unkn-0.0" \ RGSM="GAIIx-chr21-BWA.mem" \ VALIDATION_STRINGENCY=LENIENT \ 2>${infolder}/AddOrReplaceReadGroups.err
apply Picard BuildBamIndex.jar
BAM files need be indexed for fast random access query
infolder=gatk_variants infile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam java -Xmx6g -jar $PICARD/BuildBamIndex.jar \ I=${infolder}/${infile} \ 2>${infolder}/BuildBamIndex.err
Obtain the required gatk bundle reference files
Download the Resource Bundle: The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. Please see this page for further details on the content of this resource bundle. The bundle is available for download from the GSA public FTP server (location: ftp.broadinstitute.org username: gsapubftp-anonymous password: <leave blank>).
The following script would download what we need today from the Bundle. Beware that files and path may change in the future
# ftp settings ftpusr=gsapubftp-anonymous bundleurl=ftp.broadinstitute.org bundlepath=/bundle/2.5/hg19 # local settings bundle=bundle_2.5 mkdir -p ${bundle} # create a list of files to download LIST=$(cat <<'END_HEREDOC' Mills_and_1000G_gold_standard.indels.hg19.vcf.gz 1000G_phase1.snps.high_confidence.hg19.vcf.gz dbsnp_137.hg19.vcf.gz hapmap_3.3.hg19.vcf.gz 1000G_phase1.snps.high_confidence.hg19.vcf.gz 1000G_omni2.5.hg19.vcf.gz END_HEREDOC ) # download each LIST file for file in ${LIST}; do # download is absent [ -f "${bundle}/${file}" ] || wget -np -P ${bundle} --ftp-user=${ftpusr} ${bundleurl}:${bundlepath}/${file} # convert gzip to bgzip to unable tabix indexing gunzip ${file} bgzip ${file%%.gz} # index with tabix tabix -p vcf ${bundle}/${file} done # create a list of reference files to download LIST2=$(cat <<'END_HEREDOC' ucsc.hg19.fasta.gz ucsc.hg19.fasta.fai.gz ucsc.hg19.dict.gz END_HEREDOC ) # download each LIST2 file for file in ${LIST2}; do # download is absent [ -f "${bundle}/${file}" ] || wget -np -P ${bundle} --ftp-user=${ftpusr} ${bundleurl}:${bundlepath}/${file} gunzip ${file} done
alternatively, use the following script to download the whole bundle (>375Gb!)
#! /usr/bin/env bash ## script: 'wget-bundle.sh' ## ©SP-BITS, 2013 v1.0 # get all proper bundle files for version 2.5 at once # ~375GB in total, this will take SOME time!!! ver=2.5 local=$BIODATA/bundle_${ver} mkdir -p ${local} # loop in distant path and mirror all for folder in exampleFASTA hg18 hg19 b36 b37; do echo "downloading data for "${folder} mkdir -p ${local}/${folder} cd ${local}/${folder} wget --ftp-user=gsapubftp-anonymous \ --ftp-password="" \ --retr-symlinks \ -S -nd -np \ --reject ".*,*.md5*,*.out" \ -r ftp.broadinstitute.org:/bundle/${ver}/${folder} echo "# done for ${folder}" done # also get Liftover_Chain_Files lcf=Liftover_Chain_Files mkdir -p ${local}/${lcf} cd ${local}/${lcf} echo "downloading data for "${lcf} wget --ftp-user=gsapubftp-anonymous \ --ftp-password="" \ --retr-symlinks \ -S -nd -np \ --reject ".*,*.md5*,*.out" \ -r ftp.broadinstitute.org:/${lcf} echo "# done for ${lcf}"
Run the next code block to define the different bundle file paths in each new terminal window that will be used for the remaining of this workflow
# define local variable bundle=bundle_2.5 # INDEL gold standards mills=${bundle}/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz phase1indel=${bundle}/1000G_phase1.indels.hg19.vcf.gz # pick one of the former two as gold_indels set gold_indels=${mills} # SNP gold standards dbsnp=${bundle}/dbsnp_137.hg19.vcf.gz hapmap=${bundle}/hapmap_3.3.hg19.vcf.gz phase1snp=${bundle}/1000G_phase1.snps.high_confidence.hg19.vcf.gz omni=${bundle}/1000G_omni2.5.hg19.vcf.gz
Improve mapping by local realignment (indels)
identify regions for indel local realignment of the selected chromosome with gatk RealignerTargetCreator
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R ${reference} \ -I ${infolder}/${infile} \ -L chr21 \ -known ${gold_indels} \ -o ${infolder}/target_intervals.list \ 2>${infolder}/RealignerTargetCreator.err
target_intervals.list preview
chr21:9466976-9466977 chr21:9467731-9467732 chr21:9475550 chr21:9475796-9475802 chr21:9476537-9476549 chr21:9477336-9477337 chr21:9477436-9477439 chr21:9477710 chr21:9477945-9477956 chr21:9478355-9478399 ...
perform indel local realignment of the selected chromosome with gatk IndelRealigner
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R ${reference} \ -I ${infolder}/${infile} \ -L chr21 \ -targetIntervals ${infolder}/target_intervals.list \ -known ${gold_indels} \ -o ${infolder}/realigned_reads.bam \ 2>${infolder}/IndelRealigner.err
Analyze patterns of covariation in the sequence dataset with gatk BaseRecalibrator
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=realigned_reads.bam java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R ${reference} \ -I ${infolder}/${infile} \ -L chr21 \ -knownSites ${dbsnp} \ -knownSites ${gold_indels} \ -o ${infolder}/recal_data.table \ 2>${infolder}/BaseRecalibrator.err
do a second pass to analyze covariation remaining after recalibration with gatk BaseRecalibrator
Use the same BAM data as in the first iteration
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=realigned_reads.bam java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R ${reference} \ -I ${infolder}/${infile} \ -L chr21 \ -knownSites ${dbsnp} \ -knownSites ${gold_indels} \ -BQSR ${infolder}/recal_data.table \ -o ${infolder}/post_recal_data.table \ 2>${infolder}/BaseRecalibrator_2.err
generate before/after plots with gatk AnalyzeCovariates
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T AnalyzeCovariates \ -R ${reference} \ -L chr21 \ -before ${infolder}/recal_data.table \ -after ${infolder}/post_recal_data.table \ -plots ${infolder}/recalibration_plots.pdf \ 2>${infolder}/AnalyzeCovariates.err
AnalyzeCovariates recalibration plots | |
---|---|
apply the recalibration to the sequence data with gatk PrintReads
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=realigned_reads.bam java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T PrintReads \ -R ${reference} \ -L chr21 \ -I ${infolder}/${infile} \ -BQSR ${infolder}/recal_data.table \ -o ${infolder}/recal_reads.bam \ 2>${infolder}/PrintReads.err
Compress the sequence data with gatk ReduceReads
The BAM file obtained in the last step is about twice as big as the initial BAM. In this step, the size is brought back to a fraction of this initial size (about half). This compression may not be compatible with all GATK tools and older application workflows, please check with the Broad support team.
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=recal_reads.bam java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ${reference} \ -I ${infolder}/${infile} \ -L chr21 \ -o ${infolder}/reduced_reads.bam \ 2>${infolder}/ReduceReads.err
Call raw SNP and Indel variants with gatk HaplotypeCaller
For a diploid genome, gatk HaplotypeCaller is advised.
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=reduced_reads.bam We apply here the arbitrary prior filter values of '''10''' and '''30''' to obtain good quality variants. java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R ${reference} \ -I ${infolder}/${infile} \ -L chr21 \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 30 \ -o ${infolder}/raw_variants.vcf \ 2>${infolder}/HaplotypeCaller.err
Raw variants contain a number of borderline calls and false positives. removing these is an art and has long been based on empirical cutoff filtering. recent advances in data analysis (Broad Institute) have led to a more statistical approach involving variant recalibration based on the prior knowledge of known variants.
Recalibrate variant calls for SNPs and Indels
A number of Broad pages and documents are dedicated to variant recalibration, please read http://www.broadinstitute.org/gatk/guide/article?id=39[4] and included links. Please also view the great videos linked from these same page.
During this step, additional annotations are added to the data that allow later filtering and selection of subsets. Some of the available annotations are listed here.
- Build the SNP recalibration model
- also add annotations from sources out of:
- BaseQualityRankSumTest (BaseQRankSum)
- DepthOfCoverage (DP)
- FisherStrand (FS)
- HaplotypeScore (HaplotypeScore)
- MappingQualityRankSumTest (MQRankSum)
- MappingQualityZero (MQ0)
- QualByDepth (QD)
- ReadPositionRankSumTest (ReadPosRankSum)
- RMSMappingQuality (MQ)
- SnpEff: Add genomic annotations using the third-party tool SnpEff with VariantAnnotator
build the SNP recalibration model with gatk VariantRecalibrator
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=raw_variants.vcf java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R ${reference} \ -input ${infolder}/${infile} \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap} \ -resource:omni,known=false,training=true,truth=false,prior=12.0 ${omni} \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${phase1snp} \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbsnp} \ -an DP \ -an QD \ -an FS \ -an MQRankSum \ -an ReadPosRankSum \ -mode SNP \ -tranche 100.0 \ -tranche 99.9 \ -tranche 99.0 \ -tranche 90.0 \ -numBad 1000 \ -recalFile ${infolder}/recalibrate_SNP.recal \ -tranchesFile ${infolder}/recalibrate_SNP.tranches \ -rscriptFile ${infolder}/recalibrate_SNP_plots.R \ 2>${infolder}/VariantRecalibrator_snp.err
Plots are generated by R for SNP tranches
SNP tranche recalibration plots | |
---|---|
apply SNP recalibration
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=raw_variants.vcf java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R ${reference} \ -input ${infolder}/${infile} \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile ${infolder}/recalibrate_SNP.recal \ -tranchesFile ${infolder}/recalibrate_SNP.tranches \ -o ${infolder}/recalibrated_snps_raw_indels.vcf \ 2>${infolder}/ApplyRecalibration_snp.err
Detailed plots are provided for each step of the recalibration. We only reproduce here page #1 of 10 similar pannels
recalibration_plots | |
---|---|
build the Indel recalibration model with gatk VariantRecalibrator
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=recalibrated_snps_raw_indels.vcf java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R ${reference} \ -input ${infolder}/${infile} \ -resource:mills,known=true,training=true,truth=true,prior=12.0 ${mills} \ -an DP \ -an FS \ -an MQRankSum \ -an ReadPosRankSum \ -mode INDEL \ -tranche 100.0 \ -tranche 99.9 \ -tranche 99.0 \ -tranche 90.0 \ -numBad 1000 \ --maxGaussians 4 \ -recalFile ${infolder}/recalibrate_INDEL.recal \ -tranchesFile ${infolder}/recalibrate_INDEL.tranches \ -rscriptFile ${infolder}/recalibrate_INDEL_plots.R \ 2>${infolder}/VariantRecalibrator_indels.err
Plots are generated by R for INDEL tranches
INDEL tranche recalibration plots | |
---|---|
apply indels recalibration
reference=${bundle}/ucsc.hg19.fasta infolder=gatk_variants infile=recalibrated_snps_raw_indels.vcf java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R ${reference} \ -input ${infolder}/${infile} \ -mode INDEL \ --ts_filter_level 99.0 \ -recalFile ${infolder}/recalibrate_INDEL.recal \ -tranchesFile ${infolder}/recalibrate_INDEL.tranches \ -o ${infolder}/recalibrated_variants.vcf \ 2>${infolder}/ApplyRecalibration_indels.err
Again, plots are generated by R
recalibration_plots | |
---|---|
The content of the file can be estimated based on the FILTER column
infolder=gatk_variants infile=recalibrated_variants.vcf grep -v "^#" ${infolder}/${infile} | cut -f 7 | sort | uniq -c
992 LowQual 78399 PASS 1323 VQSRTrancheINDEL99.00to99.90 355 VQSRTrancheINDEL99.90to100.00 7541 VQSRTrancheSNP99.00to99.90 3023 VQSRTrancheSNP99.90to100.00 # compress and index the vcf file bgzip -c ${infolder}/${infile} > ${infolder}/${infile}.gz &&\ tabix -p vcf ${infolder}/${infile}.gz
Filter high quality data for comparison analysis using snpEff
The double-recalibrated file contains a FILTER field (column #7) that can be used to extract a high-quality subset
snpEff is a very powerful tools that can do much more than what is presented below. Please refer to its documentation to find out.
overview of snpEff commands
snpEff version SnpEff 3.3h (build 2013-08-20), by Pablo Cingolani Usage: snpEff [command] [options] [files] Available commands: [eff] : Calculate effect of variants. Default: eff (no command or 'eff'). build : Build a SnpEff database. buildNextProt : Build a SnpEff for NextProt (using NextProt's XML files). cds : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. closest : Annotate the closest genomic region. count : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval. databases : Show currently available databases (from local config file). download : Download a SnpEff database. dump : Dump to STDOUT a SnpEff database (mostly used for debugging). genes2bed : Create a bed file from a genes list. len : Calculate total genomic length for each marker type. protein : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. spliceAnalysis : Perform an analysis of splice sites. Experimental feature. Run 'java -jar snpEff.jar command' for help on each specific command
overview of SnpSift commands
SnpSift version SnpSift 3.3h (build 2013-08-20), by Pablo Cingolani Usage: java -jar SnpSift.jar [command] params... Command is one of: alleleMat : Create an allele matrix output. annotate : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted. annMem : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries. caseControl : Compare how many variants are in 'case' and in 'control' groups; calculate p-values. ccs : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect. concordance : Concordance metrics between two VCF files. covMat : Create an covariance matrix output (allele matrix as input). dbnsfp : Annotate with multiple entries from dbNSFP. <EXPERIMENTAL> extractFields : Extract fields from VCF file into tab separated format. filter : Filter using arbitrary expressions geneSets : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.) gt : Add Genotype to INFO fields and remove genotype fields when possible. gwasCat : Annotate using GWAS catalog hwe : Calculate Hardy-Weimberg parameters and perform a godness of fit test. intersect : Intersect intervals (genomic regions). intervals : Keep variants that intersect with intervals. intIdx : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve join : Join files by genomic region. phastCons : Annotate using conservation scores (phastCons). private : Annotate if a variant is private to a family or group. rmRefGen : Remove reference genotypes. rmInfo : Remove INFO fields. sift : Annotate using SIFT scores from a VCF file. split : Split VCF by chromosome. tstv : Calculate transiton to transversion ratio. varType : Annotate variant type (SNP,MNP,INS,DEL or MIXED). vcf2tped : Convert VCF to TPED.
infolder=gatk_variants infile=recalibrated_variants.vcf # equivalent to filter "( ! FILTER='LowQual' )" cat ${infolder}/${infile} | \ java -Xmx6g -jar /opt/biotools/snpEff/SnpSift.jar \ filter "( na FILTER ) | (FILTER = 'PASS' | FILTER =~ 'VQSRT')" \ | bgzip -c > ${infolder}/hq_recalibrated_variants.vcf.gz && \ tabix -p vcf ${infolder}/hq_recalibrated_variants.vcf.gz
Perform QC on the filtered VCF calls
Two tools can be used that are part of VCFTools and HTSLIB
The older command vcf-stats will generate text files that can be consulted for a given metric.
command arguments
Usage: vcf-stats [OPTIONS] file.vcf.gz Options: -d, --dump <file> Take an existing dump file and recreate the files (works with -p) -f, --filters <filter1,filter2> List of filters such as column/field (any value), column/field=bin:max (cluster in bins),column/field=value (exact value) -p, --prefix <dir/string> Prefix of output files. If slashes are present, directories will be created. -s, --samples <list> Process only the listed samples, - for none. Excluding unwanted samples may increase performance considerably. -h, -?, --help This help message. Examples: # Calculate stats separately for the filter field, quality and non-indels vcf-stats file.vcf.gz -f FILTER,QUAL=10:200,INFO/INDEL=False -p out/ # Calculate stats for all samples vcf-stats file.vcf.gz -f FORMAT/DP=10:200 -p out/ # Calculate stats only for the sample NA00001 vcf-stats file.vcf.gz -f SAMPLE/NA00001/DP=1:200 -p out/ vcf-stats file.vcf.gz > perl.dump
infolder=gatk_variants infile=recalibrated_variants.vcf.gz vcf-stats ${infolder}/${infile} -p ${infolder}/vcf_stats_${infile%%.vcf.gz}/ infile=hq_recalibrated_variants.vcf.gz vcf-stats ${infolder}/${infile} -p ${infolder}/vcf_stats_${infile%%.vcf.gz}/
list of text files resulting from such command
-rw-rw-r-- 1 bits bits 58 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.counts -rw-rw-r-- 1 bits bits 20K Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.dump -rw-rw-r-- 1 bits bits 58 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.indels -rw-rw-r-- 1 bits bits 316 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.legend -rw-rw-r-- 1 bits bits 47 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.private -rw-rw-r-- 1 bits bits 39 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.qual-tstv -rw-rw-r-- 1 bits bits 77 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.samples-tstv -rw-rw-r-- 1 bits bits 31 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.shared -rw-rw-r-- 1 bits bits 58 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.snps -rw-rw-r-- 1 bits bits 61 Nov 8 15:35 vcf_stats_recalibrated_variants.vcf.gz.tstv
The current version of htslib produces nicer output using htscmd vcfcheck and plot-vcfcheck as already seen in the BITS training.
The 'htscmd' package has been moved to 'bcftools_0.2'. The stat-generating command is now named 'bcftools stats' and the new plotting command is 'plot-vcfstats'
infolder=gatk_variants # apply htscmd vcfcheck to the full recalibrated data infile=recalibrated_variants.vcf.gz htscmd vcfcheck ${infolder}/${infile} > \ ${infolder}/${infile%%.vcf.gz}.check plot-vcfcheck -p ${infolder}/${infile%%.vcf.gz}.plots/ \ ${infolder}/${infile%%.vcf.gz}.check # apply htscmd vcfcheck to the HQ recalibrated data infile=hq_recalibrated_variants.vcf.gz htscmd vcfcheck ${infolder}/${infile} > \ ${infolder}/${infile%%.vcf.gz}.check plot-vcfcheck -p ${infolder}/${infile%%.vcf.gz}.plots/ \ ${infolder}/${infile%%.vcf.gz}.check
This becomes with the new versions:
infolder=gatk_variants # apply htscmd vcfcheck to the full recalibrated data infile=recalibrated_variants.vcf.gz bcftools_0.2.0 stats ${infolder}/${infile} >\ ${infolder}/${infile%%.vcf.gz}.check plot-vcfstats_0.2.0 -p ${infolder}/${infile%%.vcf.gz}.plots/ \ ${infolder}/${infile%%.vcf.gz}.check # ...
vcfcheck_plots | For all GATK calls |
---|---|
vcfcheck_plots | For the High Quality subset |
---|---|
The ts/tv' ratio goes from 1.9 for all calls to 2.0 the HQ subset, both values are indicative of good quality calls.
Estimate variant concordance between bcftools/samtools and gatk pipelines
The BWA mem mappings were used to call variants using the GATK Haplotype caller (NGS_Exercise.5_GATK) and are compared here to the bcftools calls obtained above for the same BAM data.
concordance between bcftools and gatk calls on BWA mem
The full GATK list and the higher confidence subset are compared to the filtered bcftools list of calls.
bcffile=bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz gatkfile=gatk_variants/recalibrated_variants.vcf.gz gatkhqfile=gatk_variants/hq_recalibrated_variants.vcf.gz outfolder=vcf-venn/bcftools-gatk mkdir -p ${outfolder} # tripple-comparison vcf-compare ${bcffile} \ ${gatkfile} \ ${gatkhqfile} | \ grep "^VN" > ${outfolder}/compare-3_pos.cmp
comparison results
VN 230 bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (0.3%) gatk_variants/recalibrated_variants.vcf.gz (0.3%) VN 762 gatk_variants/recalibrated_variants.vcf.gz (0.8%) VN 3960 bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (4.7%) VN 10032 gatk_variants/hq_recalibrated_variants.vcf.gz (11.1%) gatk_variants/recalibrated_variants.vcf.gz (10.9%) VN 80609 bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (95.1%) gatk_variants/hq_recalibrated_variants.vcf.gz (88.9%) gatk_variants/recalibrated_variants.vcf.gz (88.0%)
GATK calls 10032 additional variants as compared to the filtered bcftools list which in turn shows 3960 private variants
concordance between bcftools and gatk calls on BWA mem and CASAVA calls
This comparison can be related to measuring the sensitivity.
bcffile=bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz gatkfile=gatk_variants/recalibrated_variants.vcf.gz gatkhqfile=gatk_variants/hq_recalibrated_variants.vcf.gz casava=public_info/Illumina_BaseSpace/chr21_SNPs_NA18507_CASAVA-1.8_hg19.recode.vcf.gz outfolder=vcf-venn/bcftools-gatk mkdir -p ${outfolder} # 4way-comparison vcf-compare ${bcffile} \ ${gatkfile} \ ${gatkhqfile} \ ${casava} | grep "^VN" > ${outfolder}/compare-4casava_pos.cmp
comparison results
concordance between bcftools and gatk calls on BWA mem and the HapMap 3.3 gold standard
This comparison can be related to measuring the specificity.
The Hapmap set was filtered to keep only calls consistent with parent haplotypes and is limited to SNPs.
bcffile=bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz gatkfile=gatk_variants/recalibrated_variants.vcf.gz gatkhqfile=gatk_variants/hq_recalibrated_variants.vcf.gz goldlist=public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz outfolder=vcf-venn/bcftools-gatk mkdir -p ${outfolder} # 4way-comparison vcf-compare ${bcffile} \ ${gatkfile} \ ${gatkhqfile} \ ${goldlist} | grep "^VN" > ${outfolder}/compare-4gold_pos.cmp
comparison results
download exercise files
Download exercise files here
References:
- ↑ http://www.broadinstitute.org/gatk/guide/
- ↑ http://www.youtube.com/playlist?list=PLlMMtlgw6qNjSDAnDIDurM5g_C73hK4ya
- ↑ http://gatkforums.broadinstitute.org/discussion/2909/howto-fix-a-badly-formatted-bam
- ↑ http://www.broadinstitute.org/gatk/guide/article?id=39
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5 | NGS Exercise.6 ]