NGS Exercise.7
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS Exercise.7_SnpEff | NGS_Exercise.7_vcfCodingSnps | NGS Exercise.7_annovar | NGS Exercise.8 ]
# updated 2014 version
Add annotations to variant calling format
Contents
Introduction
Variant lists are important but often long and not easy to evaluate. In order to rank candidate variant for validation, we need to know where these variants occur and what effect they may have on the regulation of genes when close or included into a gene region or on the protein product when falling into exons.
Whatever software you apply to clean your variant calls, you will still need to validate what you get from NGS :o)
Several tools exist to annotate variant lists and predict variant effects, we present here several popular tools
WEB based tools that will expose your data to the internet
- the recent EnsEMBL Variant Effect Predictor tool (VEP:http://www.ensembl.org/info/docs/variation/vep/index.html[1])
- and the more recent and UCSC graphical Variant Annotation Integrator (VAI:http://genome.ucsc.edu/cgi-bin/hgVai[2]).
- SeattleSeq as another web tool.
Command-line tools running on your local computer
- Annovar (http://www.openbioinformatics.org/annovar/[3], publication[4])
- vcfCodingSnps <http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps>[5]
- SnpEff and SnpSift <http://snpeff.sourceforge.net>[6]
Other 'not tested' variant annotation tools exist
- variant tools <http://varianttools.sourceforge.net/Annotation/HomePage>[7]
- VAT <http://vat.gersteinlab.org/download.php>[8]
- VariantAnnotation (a Bioconductor package) <http://www.bioconductor.org/packages/2.12/bioc/html/VariantAnnotation.html>[9]
- And many more that you can find with Dr G. search now
Web-based variant annotation tools
Some will prefer a quick and easy analysis platform. For those lazy and hurry user, web-alternatives exist to annovar and are briefly presented now. Before jumping onto these.
Submitting your data on a web page means exposing it to the public and may violate patient confidentiality claims and/or compromise patent'ability of your findings at a later stage. Discuss this with your supervisor before doing it
EnsEMBL VEP quick overview
Developed by the EnsEMBL team, this tool is also made of Perl code and interacts with the huge EnsEMBL database to collect annotations and up-to-date genome information. Both Web and standalone versions are available (info:<http://www.ensembl.org/info/docs/variation/vep/index.html#web>, web:<http://www.ensembl.org/tools.html>). Note that the web interface is limited to few 100's of variants (750) and that you will 'violate' confidentiality terms by posting your variants on the Internet. Results are in VEP format that can be read and filtered in your favorite spreadsheet application (hélas!) or better in Google Refine if you have a huge file to work on.
the VEP submission page
UCSC VAI: a starter
Uploading the variant list requires reformatting the list in vcf format and selecting the matching reference genome on the submission page.
the VAI submission page
Most if not all annotation types shown in the screenshot above are accessible in Annovar. Typical annotations created by VAI are as follows
the VAI annotation types
results are formatted in VEP format similar to that generated by the homologous EnsEMBL tool.
The major drawbacks of WEB annotators are twofold - i) the lackof confidentiality when submitting your variants to the WEB, and ii) the size limit to few 100 lines when submitting data to the WEB tools (except for SeattleSeq that supported our 80'000 rows). For these reasons, the inline standalone tools described next are preferred by most advanced users.
SeattleSeq: a starter
The server can be found at http://snp.gs.washington.edu/. [10] Uploading a VCF variant list is possible. The analysis is queued and a mail sent when results are ready
The SeattleSeq Annotation server provides annotation of SNVs (single-nucleotide variations) and small indels, both known and novel. This annotation includes dbSNP rs ID, gene names and accession numbers, variation functions (e.g. missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association. Links to other annotation sites are also provided.
the SeattleSeq submission page
top five rows of results
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GAIIx-chr21-BWA.mem 21 9467416 rs369392604 C T 86 . \ SF=0,1;DN=138;DA=C/T;GM=none;\ FG=intergenic;FD=unknown;CP=0.170;AA=C;DSP=440077 GT:PL:GQ 0/1:116,0,126:99 21 9467417 rs372306150 A C 111 . \ SF=0,1,2;DN=138;DA=A/C;GM=none;\ FG=intergenic;FD=unknown;CP=0.155;AA=A;DSP=440076 GT:PL:GQ 0/1:141,0,95:98 21 9471670 . A G 37.8 . \ SF=0,1,2;GM=none;FG=intergenic;FD=unknown;\ CP=0.000;CG=0.202;AA=G;DSP=435823 GT:PL:GQ 1/1:69,6,0:10 21 9472931 . T G 52 . \ SF=0,1,2;GM=none;FG=intergenic;FD=unknown;\ CP=0.002;CG=-0.931;AA=T;RM=MLT1D;DSP=434562 GT:PL:GQ 1/1:84,9,0:16 21 9473159 rs74477762 A G 52 . \ SF=0,1,2;DN=131;DA=A/G;GM=none;\ FG=intergenic;FD=unknown;CP=0.000;CG=-1.430;AA=G;DG;DV=by-frequency,by-cluster;\ DSP=434334 GT:PL:GQ 1/1:84,9,0:16
variants with stop codons ('TRP/stop' in C21orf33; 'stop/CYS' in TSPEAR/KRTAP10-4)
21 45557227 rs74418161 G A 216 . SF=0,1,2;DN=132;DA=G/A;\ GM=NM_004649.6,NM_198155.3,XM_005261183.1,XM_005261184.1,XM_005261185.1,\ XM_005261186.1,XM_005261187.1;GL=C21orf33;FG=intron,intron,stop-gained,intron,intron,intron,intron;\ FD=intron-variant,intron-variant,unknown,unknown,unknown,unknown,unknown;AAC=none,none,\ TRP/stop,none,none,none,none;PP=NA,NA,159/296,NA,NA,NA,NA;CDP=NA,NA,477,NA,NA,NA,NA;\ CP=0.000;CG=-3.700;AA=G;DG;DV=by-frequency,by-cluster,by-1000G;DSP=33;GESP=A:518/G:12488;\ PAC=NA,NA,XP_005261240.1,NA,NA,NA,NA GT:PL:GQ 0/1:246,0,252:99 21 45994841 rs7276273 A C 124 . SF=0,1,2;DN=116;DA=A/C;\ GM=NM_001272037.1,NM_144991.2,NM_198687.1,XM_005261158.1;\ GL=TSPEAR/KRTAP10-4;FG=intron,intron,stop-lost,intron;\ FD=intron-variant,intron-variant,stop-lost,unknown;AAC=none,none,stop/CYS,none;\ PP=NA,NA,402/402,NA;CDP=NA,NA,1206,NA;CP=0.962;CG=4.380;AA=C;DG;\ DV=by-frequency,by-cluster,by-2hit-2allele,by-1000G;DSP=1185;GESP=C:970/A:11952;\ PAC=NA,NA,NP_941960.1,NA GT:PL:GQ 0/1:154,0,134:99
The major drawbacks of WEB annotators are twofold/ the lackof confidentiality when submitting your variants to the WEB, and ii) the size limit to few 100's lines when submitting data to the WEB tools. For these reasons, the inline standalone tools described next are preferred by most advanced users.
Annovar - not perfect but still very performant
Annovar is the historical tool for annotating large lists of variants. It was designed at the early times of the VCF format and instead of adopting it, went its own wxay with its oan tabular format. This makes Annovar not so handy today as most other tools accept VCF and or BED which are not native Annovar formats.
Please refer to the separate in NGS_Exercise.7_annovar page for a startup and to the online full documentation.
VCF inline annotation tools
Two tools are described and briefly illustrated in the following pages that ADD annotations to the VCF data instead of creating non-VCF annotated tables like Annovar does. Inline annotations are powerful as they complement the variant descriptions present in the VCF while keeping all other VCF annotations BUT the resulting file is quite hard to read by human and requires post-processing and filtering to become valuable.
vcfCodingSnps
This software has been released few years ago but is still valid. We present it briefly in the following page NGS_Exercise.7_vcfCodingSnps.
SnpEff & SnpSift
Last but not least. SnpEff is the raising star for VCF annotation and filtering. This is a very powerful toolset co-developped with the Broad Institute and that will likely become the standard like GATK already is for mapping. Please refer to the separate in NGS_Exercise.7_SnpEff page for a startup and to the online full documentation.
download exercise files
Download exercise files here
References:
- ↑ http://www.ensembl.org/info/docs/variation/vep/index.html
- ↑ http://genome.ucsc.edu/cgi-bin/hgVai
- ↑ http://www.openbioinformatics.org/annovar/
- ↑
Kai Wang, Mingyao Li, Hakon Hakonarson
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.
Nucleic Acids Res: 2010, 38(16);e164
[PubMed:20601685] ##WORLDCAT## [DOI] (I p) - ↑ http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps
- ↑ http://snpeff.sourceforge.net
- ↑ http://varianttools.sourceforge.net/Annotation/HomePage
- ↑ http://vat.gersteinlab.org/download.php
- ↑ http://www.bioconductor.org/packages/2.12/bioc/html/VariantAnnotation.html
- ↑ http://snp.gs.washington.edu/SeattleSeqAnnotation138/index.jsp
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS Exercise.8 ]