NGS Exercise.7 SnpEff
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS_Exercise.7_vcfCodingSnps | NGS_Exercise.7_annovar | NGS Exercise.8 ]
# updated 2014 version
Add annotations to variant calling format with SnpEff & SnpSift
Contents
Introduction
The very powerful and complete toolbox provided with SnpEff and SnpSift will likely be part of future GATK versions. Code below is shown as appetizer. You can install the application on your own machine quite easily using the information provided in the home site and experience with the many possibilities of the software.
Overview of SnpEff and SnpSift commands
SnpEff & SnpSift [1][2] were developed by Pablo Cingolani after vcfCodingSnps (Yanming Li, Goncalo Abecasis)[3] to directly annotate VCF data and filter calls by many different ways. Both programs combine the richness of Annovar[4] annotations and the advantage of manipulating the VCF data directly and without changing format. This session only provides a starter to snpEff. Please refer to the SnpEff manual pages[5] and SnpSift manual pages[6] for more information.
- Read SnpEff usage in the full GATK GuideBook and how SnpEff annotations can be added to GATK VCF data using the GATK VariantAnnotator tool (regularly check the GATK pages for more recent versions of these documents).
SnpEff annotates and predicts the effects of variants on genes (such as amino acid changes).
• snpEff commands
snpEff version SnpEff 3.6b (build 2014-05-01), by Pablo Cingolani Usage: snpEff [command] [options] [files] Run 'java -jar snpEff.jar command' for help on each specific command Available commands: [eff] : Calculate effect of variants. Default: eff (no command or 'eff'). build : Build a SnpEff database. buildNextProt : Build a SnpEff for NextProt (using NextProt's XML files). cds : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. closest : Annotate the closest genomic region. count : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval. databases : Show currently available databases (from local config file). download : Download a SnpEff database. dump : Dump to STDOUT a SnpEff database (mostly used for debugging). genes2bed : Create a bed file from a genes list. len : Calculate total genomic length for each marker type. protein : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. spliceAnalysis : Perform an analysis of splice sites. Experimental feature. Generic options: -c , -config : Specify config file -d , -debug : Debug mode (very verbose). -dataDir <path> : Override data_dir parameter from config file. -h , -help : Show this help and exit -if , -inOffset : Offset input by a number of bases. E.g. '-inOffset 1' for one-based TXT input files -of , -outOffset : Offset output by a number of bases. E.g. '-outOffset 1' for one-based TXT output files -noLog : Do not report usage statistics to server -t : Use multiple threads (implies '-noStats'). Default 'off' -q , -quiet : Quiet mode (do not show any messages or errors) -v , -verbose : Verbose mode Database options: -canon : Only use canonical transcripts. -interval : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times) -motif : Annotate using motifs (requires Motif database). -nextProt : Annotate using NextProt (requires NextProt database). -reg <name> : Regulation track to use (this option can be used add several times). -onlyReg : Only use regulation tracks. -onlyTr <file.txt> : Only use the transcripts in this file. Format: One transcript ID per line. -ss , -spliceSiteSize <int> : Set size for splice sites (donor and acceptor) in bases. Default: 2 -ud , -upDownStreamLen <int> : Set upstream downstream interval length (in bases)
SnpSift is a toolbox that allows you to filter and manipulate annotated files
• SnpSift commands
SnpSift version SnpSift 3.6b (build 2014-05-01), by Pablo Cingolani Usage: java -jar SnpSift.jar [command] params... Command is one of: alleleMat : Create an allele matrix output. annotate : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted. annMem : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries. caseControl : Compare how many variants are in 'case' and in 'control' groups; calculate p-values. ccs : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect. concordance : Concordance metrics between two VCF files. covMat : Create an covariance matrix output (allele matrix as input). dbnsfp : Annotate with multiple entries from dbNSFP. <EXPERIMENTAL> epistasis : Epistatic case control model (this feature is alpha). extractFields : Extract fields from VCF file into tab separated format. filter : Filter using arbitrary expressions geneSets : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.) gt : Add Genotype to INFO fields and remove genotype fields when possible. gwasCat : Annotate using GWAS catalog hwe : Calculate Hardy-Weimberg parameters and perform a godness of fit test. intersect : Intersect intervals (genomic regions). intervals : Keep variants that intersect with intervals. intIdx : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve join : Join files by genomic region. phastCons : Annotate using conservation scores (phastCons). private : Annotate if a variant is private to a family or group. rmRefGen : Remove reference genotypes. rmInfo : Remove INFO fields. sift : Annotate using SIFT scores from a VCF file. split : Split VCF by chromosome. tstv : Calculate transiton to transversion ratio. varType : Annotate variant type (SNP,MNP,INS,DEL or MIXED). vcf2tped : Convert VCF to TPED.
download and install SnpEff and SnpSift on your local machine
No need to run this code, the program is ready to be used. Use these commands back in the lab to install your own version and get used to the command syntax.
# download and install the software according to the instructions # http://snpeff.sourceforge.net/download.html # edit the snpEff.config to point to the correct place for the data to be saved # default is './data' in the program folder # create a variable to point to the SnpEff installation folder in your .profile (or .bashrc) ## export SNPEFF=/path/to/snpeff # this allows calling snpeff.jar directly from anywhere with Java -Xmx4g $SNPEFF/snpEff.jar -c $SNPEFF/snpEff.config ... # or if you have saved the reference genome file elsewhere java -Xmx4g -jar $SNPEFF/snpEff.jar -dataDir <path> ... # first get current version of the software and available command list java -jar $SNPEFF/snpEff.jar # then get the full list of human genomes to identify the correct human version to install java -jar $SNPEFF/snpEff.jar databases | grep "Homo_sapiens" # select one and download it (here the most recent build37) java -jar $SNPEFF/snpEff.jar download -v GRCh37.75; # (37.75 at the time of this edit) favoured reference with full annotations java -jar $SNPEFF/snpEff.jar download -v hg19; # limited annotations ## after running the last two lines, zip files will be left behind in the current folder. ## You can safely delete these zip files if the installation reported a success.
Add human annotations to the demo 1kg VCF file
It is obvious that this example is just a minimal run just to show the power of the tools and generate data that can be loaded in IGV.
The SnpEff / SnpDiff documentation pages cited above are well made and can easily be read by biologists interested in such analysis.
SnpEff and SnpSift have been chosen by the Broad Institute in complement to the GATK, this is revealing of the quality of these packages
annotate the demo file with GRCh37 data
A test file is provided with the package, we use it here to annotate variants with the most recent human B37 version
# this example demo command annotated a demo file provided with the software # we use UCSC style chromosome naming as ##reference=GRCh37 is found in the input file java -jar $SNPEFF/snpEff.jar \ GRCh37.75 \ $SNPEFF/demo.1kg.vcf \ > $SNPEFF/gwas-annotated_demo.1kg.vcf # some warning are issued during the process WARNINGS: Some warning were detected Warning type Number of warnings WARNING_TRANSCRIPT_INCOMPLETE 6134 WARNING_TRANSCRIPT_NO_START_CODON 1240
Inspect the 1kg results and isolate interesting calls
expand on the right to see the top results
# inspect the first two lines of input with some reformatting to fit the window size grep -v "^#" demo.1kg.vcf | head -2 | tr "\t" "\ " | sed -e "s/.\{75\}/&\n/g" 1 10583 rs58108140 G A 100 PASS AVGPOST=0.7707;RSQ=0.4319;LDAF=0.2327;ERATE =0.0161;AN=2184;VT=SNP;AA=.;THETA=0.0046;AC=314;SNPSOURCE=LOWCOV;AF=0.14;AS N_AF=0.13;AMR_AF=0.17;AFR_AF=0.04;EUR_AF=0.21 1 10611 rs189107123 C G 100 PASS AN=2184;THETA=0.0077;VT=SNP;AA=.;AC=41;ERA TE=0.0048;SNPSOURCE=LOWCOV;AVGPOST=0.9330;LDAF=0.0479;RSQ=0.3475;AF=0.02;AS N_AF=0.01;AMR_AF=0.03;AFR_AF=0.01;EUR_AF=0.02 # inspect the first two lines after annotation with some reformatting to fit the window size grep -v "^#" gwas-annotated_demo.1kg.vcf | head -2 | tr "\t" "\ " | sed -e "s/.\{75\}/&\n/g" 1 10583 rs58108140 G A 100.0 PASS AVGPOST=0.7707;RSQ=0.4319;LDAF=0.2327;ERA TE=0.0161;AN=2184;VT=SNP;AA=.;THETA=0.0046;AC=314;SNPSOURCE=LOWCOV;AF=0.14; ASN_AF=0.13;AMR_AF=0.17;AFR_AF=0.04;EUR_AF=0.21;EFF=DOWNSTREAM(MODIFIER||37 80|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562||1),DOWNSTREA M(MODIFIER||3780|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504 ||1),DOWNSTREAM(MODIFIER||3780|||WASH7P|unprocessed_pseudogene|NON_CODING|E NST00000541675||1),DOWNSTREAM(MODIFIER||3821|||WASH7P|unprocessed_pseudogen e|NON_CODING|ENST00000488147||1),DOWNSTREAM(MODIFIER||3828|||WASH7P|unproce ssed_pseudogene|NON_CODING|ENST00000538476||1),INTERGENIC(MODIFIER||||||||| |1),UPSTREAM(MODIFIER||1286|||DDX11L1|processed_transcript|NON_CODING|ENST0 0000456328||1),UPSTREAM(MODIFIER||1289|||DDX11L1|transcribed_unprocessed_ps eudogene|NON_CODING|ENST00000515242||1),UPSTREAM(MODIFIER||1291|||DDX11L1|t ranscribed_unprocessed_pseudogene|NON_CODING|ENST00000518655||1),UPSTREAM(M ODIFIER||1427|||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST0 0000450305||1) 1 10611 rs189107123 C G 100.0 PASS AN=2184;THETA=0.0077;VT=SNP;AA=.;AC=41;E RATE=0.0048;SNPSOURCE=LOWCOV;AVGPOST=0.9330;LDAF=0.0479;RSQ=0.3475;AF=0.02; ASN_AF=0.01;AMR_AF=0.03;AFR_AF=0.01;EUR_AF=0.02;EFF=DOWNSTREAM(MODIFIER||37 52|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562||1),DOWNSTREA M(MODIFIER||3752|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504 ||1),DOWNSTREAM(MODIFIER||3752|||WASH7P|unprocessed_pseudogene|NON_CODING|E NST00000541675||1),DOWNSTREAM(MODIFIER||3793|||WASH7P|unprocessed_pseudogen e|NON_CODING|ENST00000488147||1),DOWNSTREAM(MODIFIER||3800|||WASH7P|unproce ssed_pseudogene|NON_CODING|ENST00000538476||1),INTERGENIC(MODIFIER||||||||| |1),UPSTREAM(MODIFIER||1258|||DDX11L1|processed_transcript|NON_CODING|ENST0 0000456328||1),UPSTREAM(MODIFIER||1261|||DDX11L1|transcribed_unprocessed_ps eudogene|NON_CODING|ENST00000515242||1),UPSTREAM(MODIFIER||1263|||DDX11L1|t ranscribed_unprocessed_pseudogene|NON_CODING|ENST00000518655||1),UPSTREAM(M ODIFIER||1399|||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST0 0000450305||1)
Filter the 1kg annotated VCF file
In this short example, we filter demo variants known from the GWAS catalog. Many other filtering options are available but by lack of time only one is shown here for both the demo set and for our own data.
# we first download the catalog wget -np http://www.genome.gov/admin/gwascatalog.txt -P $SNPEFF/data # then filter the annotated variants generated above and the GWAS catalog # 'tee' saves a copy of the stdout to file while the data is ALSO sent to screen java -jar $SNPEFF/SnpSift.jar \ gwasCat $SNPEFF/data/gwascatalog.txt \ $SNPEFF/gwas-annotated_demo.1kg.vcf | \ tee $SNPEFF/gwas-annotated_demo.1kg-test.vcf
For our VCF data this gives the following results
infolder="Final_Results/vcf-venn/gold-set" infile="chr21_common_NA18507.vcf.gz" outfolder="snpeff_chr21-results" mkdir -p ${outfolder} # then filter the BWA variants with the GWAS catalog java -jar $SNPEFF/SnpSift.jar \ gwasCat $SNPEFF/data/gwascatalog.txt \ ${infolder}/${infile} >\ ${outfolder}/gwas-${infile%%.gz} # count how many lines correspond to GWAS catalog records grep -c GWASCAT ${outfolder}/gwas-${infile%%.gz} > 82 # isolate top 9 lines with GWAS annotation grep GWASCAT ${outfolder}/gwas-${infile%%.gz} | head
review GWAS hits
##INFO=<ID=GWASCAT,Number=.,Type=String,Description="Trait related to this chromosomal position, according to GWAS catalog"> chr21 11002011 . T C 81.0 . SF=0,1,2;GWASCAT=Bipolar_disorder_and_schizophrenia GT:PL:GQ 0/1:111,0,255:99 chr21 16340289 . C T 225.0 . SF=0,1,2;GWASCAT=Cognitive_performance GT:PL:GQ 0/1:255,0,255:99 chr21 16520832 . G A 222.0 . SF=0,1,2;GWASCAT=Breast_cancer GT:PL:GQ 1/1:255,138,0:99 chr21 16817051 . A G 225.0 . SF=0,1,2;GWASCAT=Ulcerative_colitis GT:PL:GQ 0/1:255,0,229:99 chr21 16914905 . G A 225.0 . SF=0,1,2;GWASCAT=Parkinson_s_disease GT:PL:GQ 0/1:255,0,255:99 chr21 17069166 . C T 225.0 . SF=0,1,2;GWASCAT=Panic_disorder GT:PL:GQ 0/1:255,0,255:99 chr21 17483133 . A T 222.0 . SF=0,1,2;GWASCAT=Obesity_related_traits GT:PL:GQ 1/1:255,120,0:99 chr21 17667720 . T C 225.0 . SF=0,1,2;GWASCAT=Chronic_obstructive_pulmonary_disease_related_biomarkers GT:PL:GQ 0/1:255,0,255:99 chr21 18046232 . A G 225.0 . SF=0,1,2;GWASCAT=Amyotrophic_lateral_sclerosis GT:PL:GQ 0/1:255,0,255:99
Annotate NA18507-chr21 common variants with hg19
The VCF file generated in [Exercise.6] was processed and results are available behind the link below.
whereami="$(pwd)" infolder="../Final_Results/vcf-venn/gold-set" infile="chr21_common_NA18507.vcf.gz" outfolder="snpeff_chr21-results" mkdir -p ${outfolder} # we use here NCBI style chromosome naming as ##reference=hg19 is found in the input file cd ${outfolder} java -Xmx4g -jar $SNPEFF/snpEff.jar \ -c $SNPEFF/snpEff.config \ hg19 \ ${infolder}/${infile} \ > annotated_chr21_common_NA18507.vcf # return to base cd ${whereami}
VCF annotation results
head -100 /home/bits/NGS_DNASeq-training/snpeff_chr21-results/annotated_chr21_common_NA18507.vcf ##fileformat=VCFv4.1 ##samtoolsVersion=0.1.19+ ##reference=file://ref/HiSeq_UCSC_hg19.fa ##contig=<ID=chrM,length=16571> ##contig=<ID=chr1,length=249250621> ##contig=<ID=chr2,length=243199373> ##contig=<ID=chr3,length=198022430> ##contig=<ID=chr4,length=191154276> ##contig=<ID=chr5,length=180915260> ##contig=<ID=chr6,length=171115067> ##contig=<ID=chr7,length=159138663> ##contig=<ID=chr8,length=146364022> ##contig=<ID=chr9,length=141213431> ##contig=<ID=chr10,length=135534747> ##contig=<ID=chr11,length=135006516> ##contig=<ID=chr12,length=133851895> ##contig=<ID=chr13,length=115169878> ##contig=<ID=chr14,length=107349540> ##contig=<ID=chr15,length=102531392> ##contig=<ID=chr16,length=90354753> ##contig=<ID=chr17,length=81195210> ##contig=<ID=chr18,length=78077248> ##contig=<ID=chr19,length=59128983> ##contig=<ID=chr20,length=63025520> ##contig=<ID=chr21,length=48129895> ##contig=<ID=chr22,length=51304566> ##contig=<ID=chrX,length=155270560> ##contig=<ID=chrY,length=59373566> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"> ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads"> ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same"> ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)"> ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel"> ##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed"> ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies"> ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3"> ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint"> ##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio"> ##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio"> ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias"> ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL."> ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2."> ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples."> ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2."> ##INFO=<ID=QBD,Number=1,Type=Float,Description="Quality by Depth: QUAL/#reads"> ##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias"> ##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples"> ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken."> ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases"> ##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases"> ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods"> ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2."> ##fileDate=20120126 ##source=CASAVA-1.8 ##sitesMaxDepth_chr10=121.296617413957 ##indelsMaxDepth_chr10=123.655927166378 ##INFO=<ID=VARTYPE_SNV,Number=0,Type=Flag,Description="âLocus desribes a single nucleotide variantâ"> ##INFO=<ID=VARTYPE_INS,Number=0,Type=Flag,Description="âLocus describes at least one simple insertion alleleâ"> ##INFO=<ID=VARTYPE_DEL,Number=0,Type=Flag,Description="âLocal describes at least one simple deletion alleleâ"> ##INFO=<ID=VARTYPE_COMPLEX,Number=0,Type=Flag,Description="âLocus desribes at least one complex indel allele expressed as an insertion/deletion combination"> ##INFO=<ID=VARTYPE_BREAKEND,Number=0,Type=Flag,Description="âLocus desribes at least one open breakend allele"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="âType of structural variantâ"> ##INFO=<ID=CIGAR,Number=A,Type=String,Description="CIGAR alignment for each alternate allele at loci containing indels other than breakends."> ##INFO=<ID=RU,Number=A,Type=String,Description="Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs are not reported if longer than 20 bases or if both REFREP and IDREP are less than 2."> ##INFO=<ID=REFREP,Number=A,Type=String,Description="Number of times RU is repeated in reference."> ##INFO=<ID=IDREP,Number=A,Type=String,Description="Number of times RU is repeated in indel allele."> ##FILTER=<ID=QGT20,Description="Locus genotype quality is less than 20 or not computable."> ##FILTER=<ID=FILT30,Description="More than 30% of bases at a site are filtered out."> ##FILTER=<ID=SitesMaxDepth,Description="Site occurs at a filtered depth greater than 'sitesMaxDepth'."> ##FILTER=<ID=IndelsMaxDepth,Description="Indel occurs at a depth greater than 'indelsMaxDepth'."> ##FILTER=<ID=REPEAT8,Description="Indel occurs in a homopolymer or dinucleotide track with a reference repeat greater than 8."> ##FILTER=<ID=IndelConflict,Description="Conflicting indel/breakend evidence in region."> ##FILTER=<ID=SiteConflict,Description="Site conflicts with indel/breakend evidence in region."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DPU,Number=1,Type=Integer,Description="Basecalls used to genotype site after filtration"> ##FORMAT=<ID=DPF,Number=1,Type=Integer,Description="Basecalls filtered from input prior to site genotyping"> ##FORMAT=<ID=AU,Number=4,Type=Integer,Description="Used A,C,G,T basecalls greater than Q20"> ##FORMAT=<ID=DPI,Number=1,Type=Integer,Description="Basecall depth associated with indel, taken from the preceeding site."> ##source_20131105.1=vcf-isec(r899) -f -o -n =3 -w 1 bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz bcftools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz ##sourceFiles_20131105.1=0:bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz,1:bcftools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz,2:public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz ##INFO=<ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)"> ##SnpEffVersion="3.6b (build 2014-05-01), by Pablo Cingolani" ##SnpEffCmd="SnpEff hg19 Final_Results/vcf-venn/gold-set/chr21_common_NA18507.vcf.gz " ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )' "> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GAIIx-chr21-BWA.mem chr21 9467416 . C T 86.0 . SF=0,1;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 0/1:116,0,126:99 chr21 9467417 . A C 111.0 . SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 0/1:141,0,95:98 chr21 9471670 . A G 37.8 . SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 1/1:69,6,0:10 chr21 9472931 . T G 52.0 . SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 1/1:84,9,0:16 chr21 9473159 . A G 52.0 . SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 1/1:84,9,0:16 chr21 9473186 . T C 89.5 . SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 1/1:122,12,0:21 chr21 9474141 . T G 8.64 . SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 0/1:38,0,89:40 chr21 9474404 . A C 47.0 . SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1) GT:PL:GQ 1/1:79,9,0:16
Unlike GRCh37 which is very detailed and returns many nice graphics in html form, the hg19 genome database is quite limited and some of the SnpEff html output will be missing.
Use snpEff in a pipe
Combined with VCF filtering or subsetting, the annotation can be done on a specific subset of lines easily.
## not run! # we use here the '-' symbol for <stdin> and the data comes from a piped 'cat' command stream cat <some-data.vcf> | \ Java -Xmx4g -jar $SNPEFF/snpEff.jar \ -c $SNPEFF/snpEff.config \ hg19 \ - \ > annotated_subset_eff.vcf
download exercise files
Download exercise files here
References:
- ↑ http://snpeff.sourceforge.net
- ↑
Pablo Cingolani, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J Land, Xiangyi Lu, Douglas M Ruden
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.
Fly (Austin): 2012, 6(2);80-92
[PubMed:22728672] ##WORLDCAT## [DOI] (I p) - ↑ http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps
- ↑ http://www.openbioinformatics.org/annovar/
- ↑ http://snpeff.sourceforge.net/SnpEff_manual.html
- ↑ http://snpeff.sourceforge.net/SnpSift.html
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS_Exercise.7_vcfCodingSnps | NGS_Exercise.7_annovar | NGS Exercise.8 ]