NGS-Var2018 Exercise.4

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.3 | NGS-Var2018 Exercise.5 ]


Call variants against the human reference genome (hg38) with GATK.GermlineSNPsAndIndelsCaller and produce genotypes with GATK.GermlineGenotyper


ex04_wf.png

Handicon.png A separate wikibook page can be consulted about variant calling from NGS: <http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>

Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here GATK4 to call SNVs and small indels. Previous training sessions made use of Varscan and bcftools to align and call variants. This year, we switched to the generally accepted gold standard caller developed at the Broad institute. Please refer to other Wiki pages for information on how to use the other callers.

Preliminary info

quick intro on VCF format

Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.

An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].

Handicon.png As today several VCF subversion are used, please refer to the best documentation for your own application from this page

Handicon.png Currently, VCF version 4.2 is the latest build, read more about its secrets here


vcf_fig1.png
vcf_fig2.png

About analysing variants in multiple 'related' genomes

When you want to analyse variant in more than one genome (family structure of patient cohort) and compare the genomes with each other, you should ideally call the variants in a single job in order to make the difference between no-calls (absence of read support information) and REF-calls (the genome is not mutated at this position as supported by reads).

This becomes rapidly impossible when you deal with cohorts of patients/controls and another solution had to be found in order to combines studies or call from multiple genomes. The solution is to produce an intermediate format called gVCF which not only stores the variant calls but also REF calls and No-Calls. Each sample gets the full information stored in the gVCF files, which can be combined in a second step (genotyping) after processing all relevant genomes individually.

Handicon.png Broad page dedicated to explaining gVCF: https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf


gvcf_fig1.png

Call variants with GATK.GermlineSNPsAndIndelsCaller

  • start the module and link the latest BAM file and the hg38 reference
ex4_01.png
ex4_02.png
  • wait for the results
ex4_03.png

 

Convert the gVCF data to conventional VCF format with GATK.GermlineGenotyper

Handicon.png Because we use a 1-10% sample from a single chromosome, the VQSR recalibration is not applicable to our data. Only when the input is consequent, VQSR can be applied and produce improved call quality scores

  • start the module and link the latest BAM file and the hg38 reference
ex4_04.png
ex4_05.png
  • leave the variant set for recalibration on 'none' and use instead quality filtering from the next field with default cutoffs
 QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SQR > 3.0 || QUAL < 30
  • wait for the results
ex4_06.png

 

VQSR - Recalibration reports

Handicon.png When available, the VQSR reports can be accessed from the PDF file produced by the run

VQSR results from the full run

ex4_07.png

Handicon.png VQSR gaussian results are 'explained' HERE

 

Handicon.png Our variants are ready as far as GATK is concerned and should now be gene-annotated and user-filtered to identify candidate disease/phenotype-related candidates.

Compress and index the VCF data

Downstream tools require the VCF data to be sorted by coordinate, compressed, and indexed so that random search can be operate very fast based on coordinate queries.

  • We produce this compressed and indexed version of the VCF data with the module BgzipAndTabixindex as shown next
ex4_08.png
ex4_09.png

download exercise files

Download exercise files here

Use the right application to open the files present in ex4-files

References:
  1. http://www.1000genomes.org
  2. http://en.wikipedia.org/wiki/Variant_Call_Format

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.3 | NGS-Var2018 Exercise.5 ]