NGS-Var2020 Exercise.4
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.3 | NGS-Var2020 Exercise.5 ]
Call variants against the human reference genome (hg38) with GATK.GermlineSNPsAndIndelsCaller and produce genotypes with GATK.GermlineGenotyper
A separate wikibook page can be consulted about variant calling from NGS: <http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>
Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here GATK4 to call SNVs and small indels. Previous training sessions made use of Varscan and bcftools to align and call variants. This year, we switched to the generally accepted gold standard caller developed at the Broad institute. Please refer to other Wiki pages for information on how to use the other callers.
Contents
Preliminary info
quick intro on VCF format
Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.
An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].
As today several VCF subversion are used, please refer to the best documentation for your own application from this page
Currently, VCF version 4.2 is the latest build, read more about its secrets here
When you want to analyse variant in more than one genome (family structure of patient cohort) and compare the genomes with each other, you should ideally call the variants in a single job in order to make the difference between no-calls (absence of read support information) and REF-calls (the genome is not mutated at this position as supported by reads).
This becomes rapidly impossible when you deal with cohorts of patients/controls and another solution had to be found in order to combines studies or call from multiple genomes. The solution is to produce an intermediate format called gVCF which not only stores the variant calls but also REF calls and No-Calls. Each sample gets the full information stored in the gVCF files, which can be combined in a second step (genotyping) after processing all relevant genomes individually.
Broad page dedicated to explaining gVCF: https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf
Call variants with GATK.GermlineSNPsAndIndelsCaller
Inorder to focus on chr22 only we provide an extra file with all chr22 sequences from the reference or only the assembled chr22
These files can be written in GATK list format or in BED format (tab-separated) as here
- example of a list format for a sub-region of chr22 (chr22_15M_20M.list)
chr22:15000000-20000000
- All chr22 sequences from the HG38 reference in BED format (chr22_all.bed)
chr22 0 50818468 chr22_KI270731v1_random 0 150754 chr22_KI270732v1_random 0 41543 chr22_KI270733v1_random 0 179772 chr22_KI270734v1_random 0 165050 chr22_KI270735v1_random 0 42811 chr22_KI270736v1_random 0 181920 chr22_KI270737v1_random 0 103838 chr22_KI270738v1_random 0 99375 chr22_KI270739v1_random 0 73985 chr22_KI270875v1_alt 0 259914 chr22_KI270878v1_alt 0 186262 chr22_KI270879v1_alt 0 304135 chr22_KI270876v1_alt 0 263666 chr22_KI270877v1_alt 0 101331 chr22_GL383583v2_alt 0 96924 chr22_GL383582v2_alt 0 162811 chr22_KB663609v1_alt 0 74013 chr22_KI270928v1_alt 0 176103
- only the assembled chr22 from HG38 in BED format (chr22_only.bed)
chr22 0 50818468
- start the module and link the latest BAM file and the hg38 reference
- wait for the results
Convert the gVCF data to conventional VCF format with GATK.GermlineGenotyper
Because we use a 1-10% sample from a single chromosome, the VQSR recalibration is not applicable to our data. Only when the input is consequent, VQSR can be applied and produce improved call quality scores
If you add known variants here they will be tagged in the 3rd column of the VCF data. If you do not do it now, no problem, you can add them later with SnpSift
- start the module and link the latest BAM file and the hg38 reference
- leave the variant set for recalibration on 'none' and use instead quality filtering from the next field with default cutoffs
QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SQR > 3.0 || QUAL < 30
- wait for the results
VQSR - Recalibration reports from the full data Run
VQSR is a very important step to obtain high quality variant calls. For the adventurous doing this at command-line, please read more on this GATK page
VQSR results from the full run
VQSR gaussian results are 'explained' HERE
- GATK_variants.SNPs_annotation.R.pdf
- GATK_variants.INDELs_annotation.R.pdf
- GATK_variants.SNPs.tranches.pdf
Our variants are ready as far as GATK is concerned and should now be gene-annotated and user-filtered to identify candidate disease/phenotype-related candidates.
Compress and index the VCF data
Downstream tools require the VCF data to be sorted by coordinate, compressed, and indexed so that random search can be operate very fast based on coordinate queries.
- We produce this compressed and indexed version of the VCF data with the module BgzipAndTabixindex as shown next
download exercise files
Download exercise files here
Use the right application to open the files present in 100pc_ex4-files
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.3 | NGS-Var2020 Exercise.5 ]