NGS Exercise.5 GATK

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5 | NGS Exercise.6 ]


Call variations using the GATK Haplotype Caller workflow (not covered today)


broad-logo-333.png
GATK_workflow.png


ex05b_wf.png

Due to license limitations by the Broad Insitute, the GATK cannot be taught to a mixed audience of VIB and non-VIB scientists. Resukts presented here are used to validate training data but will therefore not be generated during the session.

Handicon.png We only consider here the BWA mem mapping data as this recent algorithm is considered more performant. Similarly, we only aply the GATK Haplotype Caller which is becoming the new standard and applies to diploid genomes like the one selected today

Technical.png Results presented here were obtained with Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11

Contents

Introduction

gatk-logo_sm.png

GATK heavily relies on picard tools for working with different file formats. It is therefore very important that input data be Picard compliant before tempting to use GATK. The first part of the NGS session was dedicated to improving the original hg18 data and make it more Picard-able. this kind of data reshaping will often be required when you want to use GATK. Detailed information is available on the GATK pages to perform this cleaning and inspired content for several exercises performed today. The Broad GATK wiki and learning platform as well as excellent video presentation from this summer workshop are available online and should be viewed before attempting your own analyses.

list of gatk main commands

--------------------------------------------------------------------------------
The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
Copyright (c) 2010 The Broad Institute
For support and documentation go to http://www.broadinstitute.org/gatk
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-args <arg_file>] [-I <input_file>] [-rbs <read_buffer_size>] [-et 
       <phone_home>] [-K <gatk_key>] [-tag <tag>] [-rf <read_filter>] [-L <intervals>] [-XL <excludeIntervals>] [-isr 
       <interval_set_rule>] [-im <interval_merging>] [-ip <interval_padding>] [-R <reference_sequence>] [-ndrs] [-maxRuntime 
       <maxRuntime>] [-maxRuntimeUnits <maxRuntimeUnits>] [-dt <downsampling_type>] [-dfrac <downsample_to_fraction>] [-dcov 
       <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>] [-fixMisencodedQuals] 
       [-allowPotentiallyMisencodedQuals] [-OQ] [-DBQ <defaultBaseQualities>] [-PF <performanceLog>] [-BQSR <BQSR>] [-DIQ] 
       [-EOQ] [-preserveQ <preserve_qscores_less_than>] [-globalQScorePrior <globalQScorePrior>] [-allowBqsrOnReducedBams] [-S 
       <validation_strictness>] [-rpr] [-kpr] [-sample_rename_mapping_file <sample_rename_mapping_file>] [-U <unsafe>] [-nt 
       <num_threads>] [-nct <num_cpu_threads_per_data_thread>] [-mte] [-bfh <num_bam_file_handles>] [-rgbl 
       <read_group_black_list>] [-ped <pedigree>] [-pedString <pedigreeString>] [-pedValidationType <pedigreeValidationType>] 
       [-l <logging_level>] [-log <log_to_file>] [-h] [-version]
 
 -T,--analysis_type <analysis_type>                                                      Type of analysis to run
 -args,--arg_file <arg_file>                                                             Reads arguments from the 
                                                                                         specified file
 -I,--input_file <input_file>                                                            SAM or BAM file(s)
 -rbs,--read_buffer_size <read_buffer_size>                                              Number of reads per SAM file to 
                                                                                         buffer in memory
 -et,--phone_home <phone_home>                                                           What kind of GATK run report 
                                                                                         should we generate? AWS is the 
                                                                                         default, can be NO_ET so 
                                                                                         nothing is posted to the run 
                                                                                         repository. Please see 
                                                                                         nd-how-does-it-affect-me#latest 
                                                                                         for details. (NO_ET|AWS|STDOUT)
 -K,--gatk_key <gatk_key>                                                                GATK Key file. Required if 
                                                                                         running with -et NO_ET. Please 
                                                                                         see 
                                                                                         nd-how-does-it-affect-me#latest 
                                                                                         for details.
 -tag,--tag <tag>                                                                        Arbitrary tag string to 
                                                                                         identify this GATK run as part 
                                                                                         of a group of runs, for later 
                                                                                         analysis
 -rf,--read_filter <read_filter>                                                         Specify filtration criteria to 
                                                                                         apply to each read individually
 -L,--intervals <intervals>                                                              One or more genomic intervals 
                                                                                         over which to operate. Can be 
                                                                                         explicitly specified on the 
                                                                                         command line or in a file 
                                                                                         (including a rod file)
 -XL,--excludeIntervals <excludeIntervals>                                               One or more genomic intervals 
                                                                                         to exclude from processing. Can 
                                                                                         be explicitly specified on the 
                                                                                         command line or in a file 
                                                                                         (including a rod file)
 -isr,--interval_set_rule <interval_set_rule>                                            Indicates the set merging 
                                                                                         approach the interval parser 
                                                                                         should use to combine the 
                                                                                         various -L or -XL inputs (UNION|
                                                                                         INTERSECTION)
 -im,--interval_merging <interval_merging>                                               Indicates the interval merging 
                                                                                         rule we should use for abutting 
                                                                                         intervals (ALL|
                                                                                         OVERLAPPING_ONLY)
 -ip,--interval_padding <interval_padding>                                               Indicates how many basepairs of 
                                                                                         padding to include around each 
                                                                                         of the intervals specified with 
                                                                                         the -L/--intervals argument
 -R,--reference_sequence <reference_sequence>                                            Reference sequence file
 -ndrs,--nonDeterministicRandomSeed                                                      Makes the GATK behave non 
                                                                                         deterministically, that is, the 
                                                                                         random numbers generated will 
                                                                                         be different in every run
 -maxRuntime,--maxRuntime <maxRuntime>                                                   If provided, that GATK will 
                                                                                         stop execution cleanly as soon 
                                                                                         after maxRuntime has been 
                                                                                         exceeded, truncating the run 
                                                                                         but not exiting with a failure. 
                                                                                          By default the value is 
                                                                                         interpreted in minutes, but 
                                                                                         this can be changed by 
                                                                                         maxRuntimeUnits
 -maxRuntimeUnits,--maxRuntimeUnits <maxRuntimeUnits>                                    The TimeUnit for maxRuntime 
                                                                                         (NANOSECONDS|MICROSECONDS|
                                                                                         MILLISECONDS|SECONDS|MINUTES|
                                                                                         HOURS|DAYS)
 -dt,--downsampling_type <downsampling_type>                                             Type of reads downsampling to 
                                                                                         employ at a given locus.  Reads 
                                                                                         will be selected randomly to be 
                                                                                         removed from the pile based on 
                                                                                         the method described here (NONE|
                                                                                         ALL_READS|BY_SAMPLE)
 -dfrac,--downsample_to_fraction <downsample_to_fraction>                                Fraction [0.0-1.0] of reads to 
                                                                                         downsample to
 -dcov,--downsample_to_coverage <downsample_to_coverage>                                 Coverage [integer] to 
                                                                                         downsample to. For locus-based 
                                                                                         traversals (eg., LocusWalkers 
                                                                                         and ActiveRegionWalkers),this 
                                                                                         controls the maximum depth of 
                                                                                         coverage at each locus. For 
                                                                                         non-locus-based traversals 
                                                                                         (eg., ReadWalkers), this 
                                                                                         controls the maximum number of 
                                                                                         reads sharing the same 
                                                                                         alignment start position. Note 
                                                                                         that this downsampling option 
                                                                                         does NOT produce an unbiased 
                                                                                         random sampling from all 
                                                                                         available reads at each locus: 
                                                                                         instead, the primary goal of 
                                                                                         the to-coverage downsampler is 
                                                                                         to maintain an even 
                                                                                         representation of reads from 
                                                                                         all alignment start positions 
                                                                                         when removing excess coverage. 
                                                                                         For a true across-the-board 
                                                                                         unbiased random sampling of 
                                                                                         reads, use -dfrac instead. Also 
                                                                                         note that the coverage target 
                                                                                         is an approximate goal that is 
                                                                                         not guaranteed to be met 
                                                                                         exactly: the downsampling 
                                                                                         algorithm will under some 
                                                                                         circumstances retain slightly 
                                                                                         more coverage than requested.
 -baq,--baq <baq>                                                                        Type of BAQ calculation to 
                                                                                         apply in the engine (OFF|
                                                                                         CALCULATE_AS_NECESSARY|
                                                                                         RECALCULATE)
 -baqGOP,--baqGapOpenPenalty <baqGapOpenPenalty>                                         BAQ gap open penalty (Phred 
                                                                                         Scaled).  Default value is 40. 
                                                                                          30 is perhaps better for whole 
                                                                                         genome call sets
 -fixMisencodedQuals,--fix_misencoded_quality_scores                                     Fix mis-encoded base quality 
                                                                                         scores
 -allowPotentiallyMisencodedQuals,--allow_potentially_misencoded_quality_scores          Do not fail when encountering 
                                                                                         base qualities that are too 
                                                                                         high and that seemingly 
                                                                                         indicate a problem with the 
                                                                                         base quality encoding of the 
                                                                                         BAM file
 -OQ,--useOriginalQualities                                                              If set, use the original base 
                                                                                         quality scores from the OQ tag 
                                                                                         when present instead of the 
                                                                                         standard scores
 -DBQ,--defaultBaseQualities <defaultBaseQualities>                                      If reads are missing some or 
                                                                                         all base quality scores, this 
                                                                                         value will be used for all base 
                                                                                         quality scores
 -PF,--performanceLog <performanceLog>                                                   If provided, a GATK runtime 
                                                                                         performance log will be written 
                                                                                         to this file
 -BQSR,--BQSR <BQSR>                                                                     The input covariates table file 
                                                                                         which enables on-the-fly base 
                                                                                         quality score recalibration 
                                                                                         (intended for use with 
                                                                                         BaseRecalibrator and 
                                                                                         PrintReads)
 -DIQ,--disable_indel_quals                                                              If true, disables printing of 
                                                                                         base insertion and base 
                                                                                         deletion tags (with -BQSR)
 -EOQ,--emit_original_quals                                                              If true, enables printing of 
                                                                                         the OQ tag with the original 
                                                                                         base qualities (with -BQSR)
 -preserveQ,--preserve_qscores_less_than <preserve_qscores_less_than>                    Bases with quality scores less 
                                                                                         than this threshold won't be 
                                                                                         recalibrated (with -BQSR)
 -globalQScorePrior,--globalQScorePrior <globalQScorePrior>                              The global Qscore Bayesian 
                                                                                         prior to use in the BQSR. If 
                                                                                         specified, this value will be 
                                                                                         used as the prior for all 
                                                                                         mismatch quality scores instead 
                                                                                         of the actual reported quality 
                                                                                         score
 -allowBqsrOnReducedBams,--allow_bqsr_on_reduced_bams_despite_repeated_warnings          Do not fail when running base 
                                                                                         quality score recalibration on 
                                                                                         a reduced BAM file even though 
                                                                                         we highly recommend against it
 -S,--validation_strictness <validation_strictness>                                      How strict should we be with 
                                                                                         validation (STRICT|LENIENT|
                                                                                         SILENT)
 -rpr,--remove_program_records                                                           Should we override the Walker's 
                                                                                         default and remove program 
                                                                                         records from the SAM header
 -kpr,--keep_program_records                                                             Should we override the Walker's 
                                                                                         default and keep program 
                                                                                         records from the SAM header
 -sample_rename_mapping_file,--sample_rename_mapping_file <sample_rename_mapping_file>   Rename sample IDs on-the-fly at 
                                                                                         runtime using the provided 
                                                                                         mapping file. This option 
                                                                                         requires that each BAM file 
                                                                                         listed in the mapping file have 
                                                                                         only a single sample specified 
                                                                                         in its header (though there may 
                                                                                         be multiple read groups for 
                                                                                         that sample). Each line of the 
                                                                                         mapping file must contain the 
                                                                                         absolute path to a BAM file, 
                                                                                         followed by whitespace, 
                                                                                         followed by the new sample name 
                                                                                         for that BAM file.
 -U,--unsafe <unsafe>                                                                    If set, enables unsafe 
                                                                                         operations: nothing will be 
                                                                                         checked at runtime.  For expert 
                                                                                         users only who know what they 
                                                                                         are doing.  We do not support 
                                                                                         usage of this argument. 
                                                                                         (ALLOW_N_CIGAR_READS|
                                                                                         ALLOW_UNINDEXED_BAM|
                                                                                         ALLOW_UNSET_BAM_SORT_ORDER|
                                                                                         NO_READ_ORDER_VERIFICATION|
                                                                                         ALLOW_SEQ_DICT_INCOMPATIBILITY|
                                                                                         LENIENT_VCF_PROCESSING|ALL)
 -nt,--num_threads <num_threads>                                                         How many data threads should be 
                                                                                         allocated to running this 
                                                                                         analysis.
 -nct,--num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread>                How many CPU threads should be 
                                                                                         allocated per data thread to 
                                                                                         running this analysis?
 -mte,--monitorThreadEfficiency                                                          Enable GATK threading 
                                                                                         efficiency monitoring
 -bfh,--num_bam_file_handles <num_bam_file_handles>                                      The total number of BAM file 
                                                                                         handles to keep open 
                                                                                         simultaneously
 -rgbl,--read_group_black_list <read_group_black_list>                                   Filters out read groups 
                                                                                         matching <TAG>:<STRING> or a 
                                                                                         .txt file containing the filter 
                                                                                         strings one per line.
 -ped,--pedigree <pedigree>                                                              Pedigree files for samples
 -pedString,--pedigreeString <pedigreeString>                                            Pedigree string for samples
 -pedValidationType,--pedigreeValidationType <pedigreeValidationType>                    How strict should we be in 
                                                                                         validating the pedigree 
                                                                                         information? (STRICT|SILENT)
 -l,--logging_level <logging_level>                                                      Set the minimum level of 
                                                                                         logging, i.e. setting INFO 
                                                                                         get's you INFO up to FATAL, 
                                                                                         setting ERROR gets you ERROR 
                                                                                         and FATAL level logging.
 -log,--log_to_file <log_to_file>                                                        Set the logging location
 -h,--help                                                                               Generate this help message
 -version,--version                                                                      Output version information
 
 alignment                       
   CheckAlignment                Validates consistency of the aligner interface
 
 annotator                       
   VariantAnnotator              Annotates variant calls with context information.
 
 beagle                          
   BeagleOutputToVCF             Takes files produced by Beagle imputation engine and creates a vcf with modified 
                                 annotations.
   ProduceBeagleInput            Converts the input VCF into a format accepted by the Beagle imputation/analysis 
                                 program.
   VariantsToBeagleUnphased      Produces an input file to Beagle imputation engine, listing unphased, hard-called 
                                 genotypes for a single sample in input variant file.
 
 bqsr                            
   AnalyzeCovariates             Tool to analyze and evaluate base recalibration ables.
   BaseRecalibrator              First pass of the base quality score recalibration -- Generates recalibration table 
                                 based on various user-specified covariates (such as read group, reported quality score, 
                                 machine cycle, and nucleotide context).
   RecalibrationPerformance      Evaluate the performance of the base recalibration process
 
 coverage                        
   CallableLoci                  Emits a data file containing information about callable, uncallable, poorly mapped, and 
                                 other parts of the genome <p/>
   CompareCallableLoci           Test routine for new VariantContext object
   DepthOfCoverage               Toolbox for assessing sequence coverage by a wide array of metrics, partitioned by 
                                 sample, read group, or library
   GCContentByInterval           Walks along reference and calculates the GC content for each interval.
 
 diagnosetargets                 
   DiagnoseTargets               Analyzes coverage distribution and validates read mates for a given interval and 
                                 sample.
 
 diagnostics                     
   BaseCoverageDistribution      Simple walker to plot the coverage distribution per base
   CoveredByNSamplesSites        Print intervals file with all the variant sites for which most of the samples have good 
                                 coverage
   ErrorRatePerCycle             Computes the read error rate per position in read (in the original 5'->3' orientation 
                                 that the read had coming off the machine)  Emits a GATKReport containing readgroup, 
                                 cycle, mismatches, counts, qual, and error rate for each read group in the input BAMs 
                                 FOR ONLY THE FIRST OF PAIR READS.
   FindCoveredIntervals          Outputs a list of intervals that are covered above a given threshold.
   ReadGroupProperties           Emits a GATKReport containing read group, sample, library, platform, center, sequencing 
                                 data, paired end status, simple read type name (e.g.
   ReadLengthDistribution        Outputs the read lengths of all the reads in a file.
 
 diffengine                      
   DiffObjects                   A generic engine for comparing tree-structured objects
 
 examples                        
   GATKPaperGenotyper            A simple Bayesian genotyper, that outputs a text based call format.
 
 fasta                           
   FastaAlternateReferenceMaker  Generates an alternative reference sequence over the specified interval.
   FastaReferenceMaker           Renders a new reference in FASTA format consisting of only those loci provided in the 
                                 input data set.
   FastaStats                    Calculate basic statistics about the reference sequence itself
 
 filters                         
   VariantFiltration             Filters variant calls using a number of user-selectable, parameterizable criteria.
 
 genotyper                       
   UnifiedGenotyper              A variant caller which unifies the approaches of several disparate callers -- Works for 
                                 single-sample and multi-sample data.
 
 haplotypecaller                 
   HaplotypeCaller               Call SNPs and indels simultaneously via local de-novo assembly of haplotypes in an 
                                 active region.
   HaplotypeResolver             Haplotype-based resolution of variants in 2 different eval files.
 
 indels                          
   IndelRealigner                Performs local realignment of reads to correct misalignments due to the presence of 
                                 indels.
   LeftAlignIndels               Left-aligns indels from reads in a bam file.
   RealignerTargetCreator        Emits intervals for the Local Indel Realigner to target for realignment.
 
 missing                         
   QualifyMissingIntervals       Walks along reference and calculates a few metrics for each interval.
 
 phasing                         
   PhaseByTransmission           Computes the most likely genotype combination and phases trios and parent/child pairs
   ReadBackedPhasing             Walks along all variant ROD loci, caching a user-defined window of VariantContext 
                                 sites, and then finishes phasing them when they go out of range (using upstream and 
                                 downstream reads).
 
 qc                              
   CheckPileup                   At every locus in the input set, compares the pileup data (reference base, aligned base 
                                 from each overlapping read, and quality score) to the reference pileup data generated 
                                 by samtools.
   CountBases                    Walks over the input data set, calculating the number of bases seen for diagnostic 
                                 purposes.
   CountIntervals                Count contiguous regions in an interval list.
   CountLoci                     Walks over the input data set, calculating the total number of covered loci for 
                                 diagnostic purposes.
   CountMales                    Walks over the input data set, calculating the number of reads seen from male samples 
                                 for diagnostic purposes.
   CountReadEvents               Walks over the input data set, counting the number of read events (from the CIGAR 
                                 operator)
   CountReads                    Walks over the input data set, calculating the number of reads seen for diagnostic 
                                 purposes.
   CountRODs                     Prints out counts of the number of reference ordered data objects encountered.
   CountRODsByRef                Prints out counts of the number of reference ordered data objects encountered along the 
                                 reference.
   CountTerminusEvent            Walks over the input data set, counting the number of reads ending in 
                                 insertions/deletions or soft-clips
   FlagStat                      A reimplementation of the 'samtools flagstat' subcommand in the GATK
   Pileup                        Emulates the samtools pileup command to print aligned reads
   PrintRODs                     Prints out all of the RODs in the input data set.
   QCRef                         Quality control for the reference fasta
   ReadClippingStats             Walks over the input reads, printing out statistics about the read length, number of 
                                 clipping events, and length of the clipping to the output stream.
 
 readutils                       
   ClipReads                     This tool provides simple, powerful read clipping capabilities to remove low quality 
                                 strings of bases, sections of reads, and reads containing user-provided sequences.
   PrintReads                    Renders, in SAM/BAM format, all reads from the input data set in the order in which 
                                 they appear in the input file.
   ReadAdaptorTrimmer            Utility tool to blindly strip base adaptors.
   SplitSamFile                  Divides the input data set into separate BAM files, one for each sample in the input 
                                 data set.
 
 reducereads                     
   CompareBAM                    Given two BAMs with different read groups, it compares them based on ReduceReads 
                                 metrics.
   ReduceReads                   Reduces the BAM file using read based compression that keeps only essential information 
                                 for variant calling
 
 validation                      
   GenotypeAndValidate           Genotypes a dataset and validates the calls of another dataset using the Unified 
                                 Genotyper.
   ValidationAmplicons           Creates FASTA sequences for use in Seqenom or PCR utilities for site amplification and 
                                 subsequent validation
 
 validationsiteselector          
   ValidationSiteSelector        Randomly selects VCF records according to specified options.
 
 varianteval                     
   VariantEval                   General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv 
                                 ratios, and a lot more)
 
 variantrecalibration            
   ApplyRecalibration            Applies cuts to the input vcf file (by adding filter lines) to achieve the desired 
                                 novel truth sensitivity levels which were specified during VariantRecalibration
   VariantRecalibrator           Create a Gaussian mixture model by looking at the annotations values over a high 
                                 quality subset of the input call set and then evaluate all input variants.
 
 variantutils                    
   CombineVariants               Combines VCF records from different sources.
   FilterLiftedVariants          Filters a lifted-over VCF file for ref bases that have been changed.
   GenotypeConcordance           Genotype concordance (per-sample and aggregate counts and frequencies, NRD/NRS and site 
                                 allele overlaps) between two callsets
   LeftAlignAndTrimVariants      Left-aligns indels from a variants file.
   LiftoverVariants              Lifts a VCF file over from one build to another.
   RandomlySplitVariants         Takes a VCF file, randomly splits variants into two different sets, and outputs 2 new 
                                 VCFs with the results.
   RegenotypeVariants            Regenotypes the variants from a VCF.
   SelectHeaders                 Selects headers from a VCF source.
   SelectVariants                Selects variants from a VCF source.
   ValidateVariants              Validates a VCF file with an extra strict set of criteria.
   VariantsToAllelicPrimitives   Takes alleles from a variants file and breaks them up (if possible) into more 
                                 basic/primitive alleles.
   VariantsToBinaryPed           Converts a VCF file to a binary plink Ped file (.bed/.bim/.fam)
   VariantsToTable               Emits specific fields from a VCF file to a tab-deliminated table
   VariantsToVCF                 Converts variants from other file formats to VCF format.
   VariantValidationAssessor     Annotates a validation (from Sequenom for example) VCF with QC metrics (HW-equilibrium, 
                                 % failed probes)

Links:

Prepare the BWA-mem BAM data for GATK analysis

This part of the tutorial was developed from a GATK wiki page dedicated to this specific issue link[3]

The steps performed below were already part of other exerscises and are repeated here for completeness. If the full training was already performed, users can safely jump to the later part of this document apply Picard BuildBamIndex.jar

apply Picard SortSam.jar

Freshly mapped reads from the BWA mem method are first sorted by read name using Picard.

infolder=Final_Results/hg19_bwa-mapping
infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.bam
outfile=nmsrt_NA18507-mem.bam
 
outfolder=gatk_variants
mkdir -p ${outfolder}
 
# sort mappings by name to keep paired reads together
java -Xmx6g -jar $PICARD/SortSam.jar \
	I=${infolder}/${infile} \
	O=${outfolder}/${outfile} \
	SO=queryname \
	VALIDATION_STRINGENCY=LENIENT \
	2>${infolder}/SortSam_queryname.err

apply Picard FixMateInformation.jar

We now fix mate information and sort by coordinate for the next step.

infolder=gatk_variants
infile=nmsrt_NA18507-mem.bam
outfile=fxmt_nmsrt_NA18507-mem.bam
 
java -Xmx6g -jar $PICARD/FixMateInformation.jar \
	I=${infolder}/${infile} \
	O=${infolder}/${outfile} \
	SO=coordinate \
	VALIDATION_STRINGENCY=LENIENT \
	2>${infolder}/FixMateInformation.err

apply Picard MarkDuplicates.jar

Marking duplicate reads avoids counting such reads several times when calling variants leading to excessive coverage depth of variable nucleotides.

infolder=gatk_variants
infile=fxmt_nmsrt_NA18507-mem.bam
outfile=mdup_fxmt_nmsrt_NA18507-mem.bam
 
java -Xmx6g -jar $PICARD/MarkDuplicates.jar \
	I=${infolder}/${infile} \
	O=${infolder}/${outfile} \
	M=${infolder}/duplicate_metrics.txt \
	VALIDATION_STRINGENCY=LENIENT \
	2>${infolder}/MarkDuplicates.err
 
# review MarkDuplicates.jar summary counts reported in the 10 first lines
head -10 ${infolder}/duplicate_metrics.txt | grep -A1 "^LIBRARY"  | transpose -t | column -t

duplicate metrics

LIBRARY                       lib-NA18507
UNPAIRED_READS_EXAMINED       17268
READ_PAIRS_EXAMINED           7477196
UNMAPPED_READS                18534
UNPAIRED_READ_DUPLICATES      4725
READ_PAIR_DUPLICATES          5428
READ_PAIR_OPTICAL_DUPLICATES  1310
PERCENT_DUPLICATION           0.001041
ESTIMATED_LIBRARY_SIZE        6783431981

apply Picard AddOrReplaceReadGroups.jar

The presence of at least one @RG lien is prerequisite for GATK to work.

infolder=gatk_variants
infile=mdup_fxmt_nmsrt_NA18507-mem.bam
outfile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam
 
# ID:NA18507
# LB=lib-NA18507
# PU=unkn-0.0
# PL:ILLUMINA
# SM:GAIIx-chr21-BWA.mem
java -Xmx6g -jar $PICARD/AddOrReplaceReadGroups.jar \
	I=${infolder}/${infile} \
	O=${infolder}/${outfile} \
	RGID="NA18507" \
	RGLB="lib-NA18507" \
	RGPL="ILLUMINA" \
	RGPU="unkn-0.0" \
	RGSM="GAIIx-chr21-BWA.mem" \
	VALIDATION_STRINGENCY=LENIENT \
	2>${infolder}/AddOrReplaceReadGroups.err

apply Picard BuildBamIndex.jar

BAM files need be indexed for fast random access query

infolder=gatk_variants
infile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam 
 
java -Xmx6g -jar $PICARD/BuildBamIndex.jar \
	I=${infolder}/${infile} \
	2>${infolder}/BuildBamIndex.err

Obtain the required gatk bundle reference files

Download the Resource Bundle: The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. Please see this page for further details on the content of this resource bundle. The bundle is available for download from the GSA public FTP server (location: ftp.broadinstitute.org username: gsapubftp-anonymous password: <leave blank>).

Handicon.png The following script would download what we need today from the Bundle. Beware that files and path may change in the future

Technical.png The download was done for you so no need to run this code during the training session

# ftp settings
ftpusr=gsapubftp-anonymous
bundleurl=ftp.broadinstitute.org
bundlepath=/bundle/2.5/hg19
 
# local settings
bundle=bundle_2.5
mkdir -p ${bundle}
 
# create a list of files to download
LIST=$(cat <<'END_HEREDOC'
Mills_and_1000G_gold_standard.indels.hg19.vcf.gz
1000G_phase1.snps.high_confidence.hg19.vcf.gz
dbsnp_137.hg19.vcf.gz
hapmap_3.3.hg19.vcf.gz
1000G_phase1.snps.high_confidence.hg19.vcf.gz
1000G_omni2.5.hg19.vcf.gz
END_HEREDOC
)
 
# download each LIST file 
for file in ${LIST}; do
# download is absent
[ -f "${bundle}/${file}" ] || wget -np -P ${bundle} --ftp-user=${ftpusr} ${bundleurl}:${bundlepath}/${file}
# convert gzip to bgzip to unable tabix indexing
gunzip ${file}
bgzip ${file%%.gz}
# index with tabix
tabix -p vcf ${bundle}/${file}
done
 
# create a list of reference files to download
LIST2=$(cat <<'END_HEREDOC'
ucsc.hg19.fasta.gz
ucsc.hg19.fasta.fai.gz
ucsc.hg19.dict.gz
END_HEREDOC
)
 
# download each LIST2 file 
for file in ${LIST2}; do
# download is absent
[ -f "${bundle}/${file}" ] || wget -np -P ${bundle} --ftp-user=${ftpusr} ${bundleurl}:${bundlepath}/${file}
gunzip ${file}
done

alternatively, use the following script to download the whole bundle (>375Gb!)

#! /usr/bin/env bash
## script: 'wget-bundle.sh'
## ©SP-BITS, 2013 v1.0
 
# get all proper bundle files for version 2.5 at once
# ~375GB in total, this will take SOME time!!!
 
ver=2.5
local=$BIODATA/bundle_${ver}
mkdir -p ${local}
 
# loop in distant path and mirror all
for folder in exampleFASTA hg18 hg19 b36 b37; do
 
        echo "downloading data for "${folder}
        mkdir -p ${local}/${folder}
        cd ${local}/${folder}
        wget --ftp-user=gsapubftp-anonymous \
                --ftp-password="" \
                --retr-symlinks \
                -S -nd -np \
                --reject ".*,*.md5*,*.out" \
                -r ftp.broadinstitute.org:/bundle/${ver}/${folder}
 
        echo "# done for ${folder}"
done
 
# also get Liftover_Chain_Files
lcf=Liftover_Chain_Files
mkdir -p ${local}/${lcf}
cd ${local}/${lcf}
 
echo "downloading data for "${lcf}
wget --ftp-user=gsapubftp-anonymous \
                --ftp-password="" \
                --retr-symlinks \
                -S -nd -np \
                --reject ".*,*.md5*,*.out" \
                -r ftp.broadinstitute.org:/${lcf}
 
echo "# done for ${lcf}"

Handicon.png You still need to expand the fasta reference file(s) and their indexes as gatk does not support gz compressed reference files.

Handicon.png Run the next code block to define the different bundle file paths in each new terminal window that will be used for the remaining of this workflow

# define local variable
bundle=bundle_2.5
 
# INDEL gold standards
mills=${bundle}/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz
phase1indel=${bundle}/1000G_phase1.indels.hg19.vcf.gz
 
# pick one of the former two as gold_indels set
gold_indels=${mills}
 
# SNP gold standards
dbsnp=${bundle}/dbsnp_137.hg19.vcf.gz
hapmap=${bundle}/hapmap_3.3.hg19.vcf.gz
phase1snp=${bundle}/1000G_phase1.snps.high_confidence.hg19.vcf.gz
omni=${bundle}/1000G_omni2.5.hg19.vcf.gz

Improve mapping by local realignment (indels)

identify regions for indel local realignment of the selected chromosome with gatk RealignerTargetCreator

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam 
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T RealignerTargetCreator \
	-R ${reference} \
	-I ${infolder}/${infile} \
	-L chr21 \
	-known ${gold_indels} \
	-o ${infolder}/target_intervals.list \
	2>${infolder}/RealignerTargetCreator.err

target_intervals.list preview

chr21:9466976-9466977
chr21:9467731-9467732
chr21:9475550
chr21:9475796-9475802
chr21:9476537-9476549
chr21:9477336-9477337
chr21:9477436-9477439
chr21:9477710
chr21:9477945-9477956
chr21:9478355-9478399
...

perform indel local realignment of the selected chromosome with gatk IndelRealigner

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=rg_mdup_fxmt_nmsrt_NA18507-mem.bam
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T IndelRealigner \
	-R ${reference} \
	-I ${infolder}/${infile} \
	-L chr21 \
	-targetIntervals ${infolder}/target_intervals.list \
	-known ${gold_indels} \
	-o ${infolder}/realigned_reads.bam \
	2>${infolder}/IndelRealigner.err

Analyze patterns of covariation in the sequence dataset with gatk BaseRecalibrator

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=realigned_reads.bam
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T BaseRecalibrator \
	-R ${reference} \
	-I ${infolder}/${infile} \
	-L chr21 \
	-knownSites ${dbsnp} \
	-knownSites ${gold_indels} \
	-o ${infolder}/recal_data.table \
	2>${infolder}/BaseRecalibrator.err

do a second pass to analyze covariation remaining after recalibration with gatk BaseRecalibrator

Use the same BAM data as in the first iteration

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=realigned_reads.bam
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T BaseRecalibrator \
	-R ${reference} \
	-I ${infolder}/${infile} \
	-L chr21 \
	-knownSites ${dbsnp} \
	-knownSites ${gold_indels} \
	-BQSR ${infolder}/recal_data.table \
	-o ${infolder}/post_recal_data.table \
	2>${infolder}/BaseRecalibrator_2.err

generate before/after plots with gatk AnalyzeCovariates

Handicon.png Requires install.packages("gsalib") in RStudio for the [R] part} to work

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T AnalyzeCovariates \
	-R ${reference} \
	-L chr21 \
	-before ${infolder}/recal_data.table \
	-after ${infolder}/post_recal_data.table \
	-plots ${infolder}/recalibration_plots.pdf \
	2>${infolder}/AnalyzeCovariates.err
AnalyzeCovariates recalibration plots
recalibration_plots-0.png
recalibration_plots-1.png
recalibration_plots-2.png
recalibration_plots-3.png
recalibration_plots-4.png

apply the recalibration to the sequence data with gatk PrintReads

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=realigned_reads.bam
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T PrintReads \
	-R ${reference} \
	-L chr21 \
	-I ${infolder}/${infile} \
	-BQSR ${infolder}/recal_data.table \
	-o ${infolder}/recal_reads.bam \
	2>${infolder}/PrintReads.err

Compress the sequence data with gatk ReduceReads

The BAM file obtained in the last step is about twice as big as the initial BAM. In this step, the size is brought back to a fraction of this initial size (about half). This compression may not be compatible with all GATK tools and older application workflows, please check with the Broad support team.

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=recal_reads.bam
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T ReduceReads \
	-R ${reference} \
	-I ${infolder}/${infile} \
	-L chr21 \
	-o ${infolder}/reduced_reads.bam \
	2>${infolder}/ReduceReads.err

Call raw SNP and Indel variants with gatk HaplotypeCaller

For a diploid genome, gatk HaplotypeCaller is advised.

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=reduced_reads.bam
 
We apply here the arbitrary prior filter values of '''10''' and '''30''' to obtain good quality variants.
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T HaplotypeCaller \
	-R ${reference} \
	-I ${infolder}/${infile} \
	-L chr21 \
	--genotyping_mode DISCOVERY \
	-stand_emit_conf 10 \
	-stand_call_conf 30 \
	-o ${infolder}/raw_variants.vcf \
	2>${infolder}/HaplotypeCaller.err

Raw variants contain a number of borderline calls and false positives. removing these is an art and has long been based on empirical cutoff filtering. recent advances in data analysis (Broad Institute) have led to a more statistical approach involving variant recalibration based on the prior knowledge of known variants.

Recalibrate variant calls for SNPs and Indels

A number of Broad pages and documents are dedicated to variant recalibration, please read http://www.broadinstitute.org/gatk/guide/article?id=39[4] and included links. Please also view the great videos linked from these same page.

During this step, additional annotations are added to the data that allow later filtering and selection of subsets. Some of the available annotations are listed here.

  • Build the SNP recalibration model
  • also add annotations from sources out of:
  • BaseQualityRankSumTest (BaseQRankSum)
  • DepthOfCoverage (DP)
  • FisherStrand (FS)
  • HaplotypeScore (HaplotypeScore)
  • MappingQualityRankSumTest (MQRankSum)
  • MappingQualityZero (MQ0)
  • QualByDepth (QD)
  • ReadPositionRankSumTest (ReadPosRankSum)
  • RMSMappingQuality (MQ)
  • SnpEff: Add genomic annotations using the third-party tool SnpEff with VariantAnnotator

build the SNP recalibration model with gatk VariantRecalibrator

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=raw_variants.vcf
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T VariantRecalibrator \
	-R ${reference} \
	-input ${infolder}/${infile} \
	-resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap} \
	-resource:omni,known=false,training=true,truth=false,prior=12.0 ${omni} \
	-resource:1000G,known=false,training=true,truth=false,prior=10.0 ${phase1snp} \
	-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbsnp} \
	-an DP \
	-an QD \
	-an FS \
	-an MQRankSum \
	-an ReadPosRankSum \
	-mode SNP \
	-tranche 100.0 \
	-tranche 99.9 \
	-tranche 99.0 \
	-tranche 90.0 \
	-numBad 1000 \
	-recalFile ${infolder}/recalibrate_SNP.recal \
	-tranchesFile ${infolder}/recalibrate_SNP.tranches \
	-rscriptFile ${infolder}/recalibrate_SNP_plots.R \
	2>${infolder}/VariantRecalibrator_snp.err

Plots are generated by R for SNP tranches

SNP tranche recalibration plots
recalibrate_SNP.tranches-0.png
recalibrate_SNP.tranches-1.png

apply SNP recalibration

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=raw_variants.vcf
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T ApplyRecalibration \
	-R ${reference} \
	-input ${infolder}/${infile} \
	-mode SNP \
	--ts_filter_level 99.0 \
	-recalFile ${infolder}/recalibrate_SNP.recal \
	-tranchesFile ${infolder}/recalibrate_SNP.tranches \
	-o ${infolder}/recalibrated_snps_raw_indels.vcf \
	2>${infolder}/ApplyRecalibration_snp.err

Detailed plots are provided for each step of the recalibration. We only reproduce here page #1 of 10 similar pannels

recalibration_plots
recalibrate_SNP_plots.R-0.png

build the Indel recalibration model with gatk VariantRecalibrator

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=recalibrated_snps_raw_indels.vcf
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T VariantRecalibrator \
	-R ${reference} \
	-input ${infolder}/${infile} \
	-resource:mills,known=true,training=true,truth=true,prior=12.0 ${mills} \
	-an DP \
	-an FS \
	-an MQRankSum \
	-an ReadPosRankSum \
	-mode INDEL \
	-tranche 100.0 \
	-tranche 99.9 \
	-tranche 99.0 \
	-tranche 90.0 \
	-numBad 1000 \
	--maxGaussians 4 \
	-recalFile ${infolder}/recalibrate_INDEL.recal \
	-tranchesFile ${infolder}/recalibrate_INDEL.tranches \
	-rscriptFile ${infolder}/recalibrate_INDEL_plots.R \
	2>${infolder}/VariantRecalibrator_indels.err

Plots are generated by R for INDEL tranches

INDEL tranche recalibration plots
recalibrate_INDEL.tranches-0.png
recalibrate_INDEL.tranches-1.png

apply indels recalibration

reference=${bundle}/ucsc.hg19.fasta
infolder=gatk_variants
infile=recalibrated_snps_raw_indels.vcf
 
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T ApplyRecalibration \
	-R ${reference} \
	-input ${infolder}/${infile} \
	-mode INDEL \
	--ts_filter_level 99.0 \
	-recalFile ${infolder}/recalibrate_INDEL.recal \
	-tranchesFile ${infolder}/recalibrate_INDEL.tranches \
	-o ${infolder}/recalibrated_variants.vcf \
	2>${infolder}/ApplyRecalibration_indels.err

Again, plots are generated by R

recalibration_plots
recalibrate_INDEL_plots.R-0.png

The content of the file can be estimated based on the FILTER column

infolder=gatk_variants
infile=recalibrated_variants.vcf
 
grep -v "^#" ${infolder}/${infile} | cut -f 7 | sort | uniq -c
    992 LowQual
  78399 PASS
   1323 VQSRTrancheINDEL99.00to99.90
    355 VQSRTrancheINDEL99.90to100.00
   7541 VQSRTrancheSNP99.00to99.90
   3023 VQSRTrancheSNP99.90to100.00
 
# compress and index the vcf file
bgzip -c ${infolder}/${infile} > ${infolder}/${infile}.gz &&\
	tabix -p vcf ${infolder}/${infile}.gz

Filter high quality data for comparison analysis using snpEff

The double-recalibrated file contains a FILTER field (column #7) that can be used to extract a high-quality subset

snpEff is a very powerful tools that can do much more than what is presented below. Please refer to its documentation to find out.

overview of snpEff commands

snpEff version SnpEff 3.3h (build 2013-08-20), by Pablo Cingolani
Usage: snpEff [command] [options] [files]
 
Available commands: 
   [eff]           : Calculate effect of variants. Default: eff (no command or 'eff').
   build           : Build a SnpEff database.
   buildNextProt   : Build a SnpEff for NextProt (using NextProt's XML files).
   cds             : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
   closest         : Annotate the closest genomic region.
   count           : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.
   databases       : Show currently available databases (from local config file).
   download        : Download a SnpEff database.
   dump            : Dump to STDOUT a SnpEff database (mostly used for debugging).
   genes2bed       : Create a bed file from a genes list.
   len             : Calculate total genomic length for each marker type.
   protein         : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
   spliceAnalysis  : Perform an analysis of splice sites. Experimental feature.
 
Run 'java -jar snpEff.jar command' for help on each specific command

overview of SnpSift commands

SnpSift version SnpSift 3.3h (build 2013-08-20), by Pablo Cingolani
Usage: java -jar SnpSift.jar [command] params...
Command is one of:
	alleleMat     : Create an allele matrix output.
	annotate      : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted.
	annMem        : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries.
	caseControl   : Compare how many variants are in 'case' and in 'control' groups; calculate p-values.
	ccs           : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect.
	concordance   : Concordance metrics between two VCF files.
	covMat        : Create an covariance matrix output (allele matrix as input).
	dbnsfp        : Annotate with multiple entries from dbNSFP. <EXPERIMENTAL>
	extractFields : Extract fields from VCF file into tab separated format.
	filter        : Filter using arbitrary expressions
	geneSets      : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.)
	gt            : Add Genotype to INFO fields and remove genotype fields when possible.
	gwasCat       : Annotate using GWAS catalog
	hwe           : Calculate Hardy-Weimberg parameters and perform a godness of fit test.
	intersect     : Intersect intervals (genomic regions).
	intervals     : Keep variants that intersect with intervals.
	intIdx        : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve
	join          : Join files by genomic region.
	phastCons     : Annotate using conservation scores (phastCons).
	private       : Annotate if a variant is private to a family or group.
	rmRefGen      : Remove reference genotypes.
	rmInfo        : Remove INFO fields.
	sift          : Annotate using SIFT scores from a VCF file.
	split         : Split VCF by chromosome.
	tstv          : Calculate transiton to transversion ratio.
	varType       : Annotate variant type (SNP,MNP,INS,DEL or MIXED).
	vcf2tped      : Convert VCF to TPED.
infolder=gatk_variants
infile=recalibrated_variants.vcf
 
# equivalent to filter "( ! FILTER='LowQual' )" 
cat ${infolder}/${infile} | \
	java -Xmx6g -jar /opt/biotools/snpEff/SnpSift.jar \
	filter "( na FILTER ) | (FILTER = 'PASS' | FILTER =~ 'VQSRT')" \
	| bgzip -c > ${infolder}/hq_recalibrated_variants.vcf.gz && \
	tabix -p vcf ${infolder}/hq_recalibrated_variants.vcf.gz

Perform QC on the filtered VCF calls

Two tools can be used that are part of VCFTools and HTSLIB

The older command vcf-stats will generate text files that can be consulted for a given metric.

command arguments

Usage: vcf-stats [OPTIONS] file.vcf.gz
Options:
   -d, --dump <file>                           Take an existing dump file and recreate the files (works with -p)
   -f, --filters <filter1,filter2>             List of filters such as column/field (any value), column/field=bin:max 
                                                (cluster in bins),column/field=value (exact value)
   -p, --prefix <dir/string>                   Prefix of output files. If slashes are present, directories will be created.
   -s, --samples <list>                        Process only the listed samples, - for none. 
                                                Excluding unwanted samples may increase performance considerably.
   -h, -?, --help                              This help message.
 
Examples:
   # Calculate stats separately for the filter field, quality and non-indels
   vcf-stats file.vcf.gz -f FILTER,QUAL=10:200,INFO/INDEL=False -p out/
 
   # Calculate stats for all samples
   vcf-stats file.vcf.gz -f FORMAT/DP=10:200 -p out/
 
   # Calculate stats only for the sample NA00001
   vcf-stats file.vcf.gz -f SAMPLE/NA00001/DP=1:200 -p out/
 
   vcf-stats file.vcf.gz > perl.dump
infolder=gatk_variants
 
infile=recalibrated_variants.vcf.gz
vcf-stats ${infolder}/${infile} -p ${infolder}/vcf_stats_${infile%%.vcf.gz}/
 
infile=hq_recalibrated_variants.vcf.gz
vcf-stats ${infolder}/${infile} -p ${infolder}/vcf_stats_${infile%%.vcf.gz}/

list of text files resulting from such command

-rw-rw-r-- 1 bits bits   58 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.counts
-rw-rw-r-- 1 bits bits  20K Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.dump
-rw-rw-r-- 1 bits bits   58 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.indels
-rw-rw-r-- 1 bits bits  316 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.legend
-rw-rw-r-- 1 bits bits   47 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.private
-rw-rw-r-- 1 bits bits   39 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.qual-tstv
-rw-rw-r-- 1 bits bits   77 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.samples-tstv
-rw-rw-r-- 1 bits bits   31 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.shared
-rw-rw-r-- 1 bits bits   58 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.snps
-rw-rw-r-- 1 bits bits   61 Nov  8 15:35 vcf_stats_recalibrated_variants.vcf.gz.tstv

The current version of htslib produces nicer output using htscmd vcfcheck and plot-vcfcheck as already seen in the BITS training.

Technical.png The 'htscmd' package has been moved to 'bcftools_0.2'. The stat-generating command is now named 'bcftools stats' and the new plotting command is 'plot-vcfstats'

infolder=gatk_variants
 
# apply htscmd vcfcheck to the full recalibrated data
infile=recalibrated_variants.vcf.gz
 
htscmd vcfcheck ${infolder}/${infile} > \
	  ${infolder}/${infile%%.vcf.gz}.check
 
plot-vcfcheck -p ${infolder}/${infile%%.vcf.gz}.plots/ \
	   ${infolder}/${infile%%.vcf.gz}.check
 
# apply htscmd vcfcheck to the HQ recalibrated data
infile=hq_recalibrated_variants.vcf.gz
 
htscmd vcfcheck ${infolder}/${infile} > \
	 ${infolder}/${infile%%.vcf.gz}.check
 
plot-vcfcheck -p ${infolder}/${infile%%.vcf.gz}.plots/ \
	 ${infolder}/${infile%%.vcf.gz}.check

This becomes with the new versions:

infolder=gatk_variants
 
# apply htscmd vcfcheck to the full recalibrated data
infile=recalibrated_variants.vcf.gz
 
bcftools_0.2.0 stats ${infolder}/${infile} >\
	${infolder}/${infile%%.vcf.gz}.check
 
plot-vcfstats_0.2.0  -p ${infolder}/${infile%%.vcf.gz}.plots/ \
	${infolder}/${infile%%.vcf.gz}.check
 
# ...
vcfcheck_plots For all GATK calls
recalibrated_variants-0.png
recalibrated_variants-1.png
recalibrated_variants-2.png
vcfcheck_plots For the High Quality subset
hq_recalibrated_variants-0.png
hq_recalibrated_variants-1.png
hq_recalibrated_variants-2.png

The ts/tv' ratio goes from 1.9 for all calls to 2.0 the HQ subset, both values are indicative of good quality calls.

Estimate variant concordance between bcftools/samtools and gatk pipelines

The BWA mem mappings were used to call variants using the GATK Haplotype caller (NGS_Exercise.5_GATK) and are compared here to the bcftools calls obtained above for the same BAM data.

concordance between bcftools and gatk calls on BWA mem

The full GATK list and the higher confidence subset are compared to the filtered bcftools list of calls.

bcffile=bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
gatkfile=gatk_variants/recalibrated_variants.vcf.gz
gatkhqfile=gatk_variants/hq_recalibrated_variants.vcf.gz
 
outfolder=vcf-venn/bcftools-gatk
mkdir -p ${outfolder}
 
# tripple-comparison
vcf-compare ${bcffile} \
	${gatkfile} \
	${gatkhqfile} | \
	grep "^VN" > ${outfolder}/compare-3_pos.cmp

comparison results

venn-3_cmp.png
VN  230    bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz  (0.3%)   gatk_variants/recalibrated_variants.vcf.gz     (0.3%)
VN  762    gatk_variants/recalibrated_variants.vcf.gz                (0.8%)
VN  3960   bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz  (4.7%)
VN  10032  gatk_variants/hq_recalibrated_variants.vcf.gz             (11.1%)  gatk_variants/recalibrated_variants.vcf.gz     (10.9%)
VN  80609  bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz  (95.1%)  gatk_variants/hq_recalibrated_variants.vcf.gz  (88.9%)  
           gatk_variants/recalibrated_variants.vcf.gz  (88.0%)

Handicon.png GATK calls 10032 additional variants as compared to the filtered bcftools list which in turn shows 3960 private variants

concordance between bcftools and gatk calls on BWA mem and CASAVA calls

This comparison can be related to measuring the sensitivity.

bcffile=bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
gatkfile=gatk_variants/recalibrated_variants.vcf.gz
gatkhqfile=gatk_variants/hq_recalibrated_variants.vcf.gz
casava=public_info/Illumina_BaseSpace/chr21_SNPs_NA18507_CASAVA-1.8_hg19.recode.vcf.gz
 
outfolder=vcf-venn/bcftools-gatk
mkdir -p ${outfolder}
 
# 4way-comparison
vcf-compare ${bcffile} \
	${gatkfile} \
	${gatkhqfile} \
	${casava} |
	grep "^VN" > ${outfolder}/compare-4casava_pos.cmp

comparison results

venn-4casava_cmp.png

concordance between bcftools and gatk calls on BWA mem and the HapMap 3.3 gold standard

This comparison can be related to measuring the specificity.

Handicon.png The Hapmap set was filtered to keep only calls consistent with parent haplotypes and is limited to SNPs.

bcffile=bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
gatkfile=gatk_variants/recalibrated_variants.vcf.gz
gatkhqfile=gatk_variants/hq_recalibrated_variants.vcf.gz
goldlist=public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz
 
outfolder=vcf-venn/bcftools-gatk
mkdir -p ${outfolder}
 
# 4way-comparison
vcf-compare ${bcffile} \
	${gatkfile} \
	${gatkhqfile} \
	${goldlist} |
	grep "^VN" > ${outfolder}/compare-4gold_pos.cmp

comparison results

venn-4gold_cmp.png

download exercise files

Download exercise files here

Use the right application to open the files present in ex5b-files

References:
  1. http://www.broadinstitute.org/gatk/guide/
  2. http://www.youtube.com/playlist?list=PLlMMtlgw6qNjSDAnDIDurM5g_C73hK4ya
  3. http://gatkforums.broadinstitute.org/discussion/2909/howto-fix-a-badly-formatted-bam
  4. http://www.broadinstitute.org/gatk/guide/article?id=39

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5 | NGS Exercise.6 ]