NGS Exercise.7 vcfCodingSnps
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS Exercise.7_annovar | NGS Exercise.8 ]
# updated 2014 version
Add annotations to variant calling format with vcfCodingSnps.v1.5
Contents
Introduction
vcfCodingSnps.v1.5 was developped to directly annotate VCF data quite similarly to Annovar. A tutorial can be found on the website as well as more info on how to prepare anotation input data. As always, care should be taken to use the same naming system for chromosomes as that used during the mapping ('chr1' or '1' depending on the reference).
command parameters
vcfCodingSnps ################################################################################################## vcfCodingSNP 1.5 -- vcf SNP annotating tool (c) 2010.12 (2012-01) Yanming Li, Goncalo Abecasis Comments and (or) suggestions are welcome! Please send to liyanmin@umich.edu. ################################################################################################## The following parameters are in effect: Availabe Options Input Files : --refgenome [referenceGenomes/genome.V36.fa], --snpfile [], --genefile [geneLists/UCSCknownGene.B36.txt], --n1 [8], --n2 [3], --ns [5], --nc [24] Output Files : --outfile [vcfCodingSNP.out.vcf], --log ## parameters read as follows: -s VCF file This option specifies the name of the input VCF-format SNP file -r refgenome This option specifies the name of the imput reference genome FASTA file. It should be of either NCBI release 36/hg18 or GRCH37/hg19 format. By default it will load NCBI36 reference genome. Users can chose to download other versions of reference genome files at the download page -o outputfile Specifies the name of the input gene file, by default use a gene file (UCSCknownGene.B36.txt) generated by UCSC genome browser -l logfile Specifies the name of the log file. -g genefile Specifies the name of the output VCF-format SNP file, by default will be named vcfCodingSNP.out.vcf -o outputfile Specifies the name of the log file, log file gives more detailed information for each annotated SNP, by default will be named vcfCodingSNP.log --n1 number user defined number of bps into intron for splice site, by default will be set to 8 --n2 number user defined number of bps into extron for splice site, by default will be set to 3 --ns number user defined number of kbps for the range of upstream or downstream of a gene, by default will be set to 5
manual example illustrating vcfCodingSnps results
Annotation output | Interpretation of the output |
---|---|
5'UTR=A26C2[-] | the SNP is in the 5'UTR region of gene A26C2 with a minus strand. |
INTRONIC=POTEG[-] | the SNP is in the intronic region of gene POTEG with a minus strand. |
SYNONYMOUS_CODING=BARD1(uc002veu.2):His506His[-] | the SNP is synonymous coding at the 506th codon in gene BARD1 with a minus strand and it keeps amino-acid His unchanged. |
NON_SYNONYMOUS_CODING=BARD1(uc002veu.2):Arg658Cys[-] | the SNP is non_synonymous coding at the 658th codon in gene BARD1 (ucsc gene name uc002veu.2)with a minus strand and it changes amino-acid from Arg to Cys. |
SPLICE_SITE=FARP2(uc002wbi.1)[+] | the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene FARP2 (ucsc gene name uc002wbi.1) with a plus strand. |
STOP_GAINED=C2orf83(uc002vph.1):Trp141stop[-] | the SNP is the 141th codon in gene MAPK12 (ucsc gene name uc002vph.1) with a minus strand and it changes amino-acid Trp to a stop codon. |
STOP_LOST=OR2M3(uc001ieb.1):stop313Arg[+] | the SNP is the 313th codon in gene OR2M3 (ucsc gene name uc001ieb.1) with a plus strand and it changes a stop codon to amino-acid Arg. |
Example results and meaning
Pos | Alt SNP | Ref SNP | Alt SNP Codon | Ref SNP Codon | Alt SNP AA | Ref SNP AA | Anno Type |
---|---|---|---|---|---|---|---|
3 | G | A | . | . | . | . | 5'UTR |
5 | A | C | CAT | CCT | His | Pro | Non_Synonymous |
13 | G | T | CCG | CCT | Pro | Pro | Synonymous |
25 | C | A | TAC | TAA | Tyr | Stop | Stop Loss |
43 | A | C | . | . | . | . | Splice Site |
44 | G | T | . | . | . | . | Intronic |
50 | C | A | . | . | . | . | Essential Splice Site |
53 | A | C | ACC | CCC | Thr | Pro | Non_Synonymous |
66 | C | T | . | . | . | . | 3'UTR |
72 | A | C | . | . | . | . | Downstram |
full list of annotation types obtained with vcfCodingSnps
Category | Definition used in vcfCoding Snps |
---|---|
stop gained | a SNP in coding sequence and introducing a TAG, TAA, or TGA stop codon |
stop lost | a SNP in coding sequence and causing a loss of a TAG, TAA, or TGA stop codon |
non- synonymous coding | a SNP in coding sequence, located in a codon resulting in a change of amino acid, excluding SNPs that can be defined as either stop gained or stop lost |
synonymous coding | a SNP in coding sequence, located in a codon that not resulting in a change of amino acid |
essential splice site | a SNP changing the highly conserved GU in the first two basepairs of the intron or (AG) in the last two basepair of the intron |
splice site | a SNP occurring in 3 - N1 basepairs into the intron, or N2 basepairs into the exon . N1 by default is 8, N2 by default is 3. N1 and N2 can be defined by user through option --n1 --n2. |
5' UTR | a SNP located within the 5' UTR of a transcript |
3' UTR | a SNP located within the 3' UTR of a transcript |
intronic | a SNP in the intron of a known gene, and cannot be defined as essential splice site or splice site |
upstream | A SNP located within N kb from the transcript start site (5'-end) of a known gene, N by default is 5 and can be defined by user through option --ns |
downstream | A SNP located within N kb from the transcript end site (3'-end) of a known gene, N by default is 5 and can be defined by user through option --ns |
introgenic | A SNP not located within a known gene and also not identified as upstream or downstream of a knowngene |
Annotate the VCF file obtained by intersecting 'aln' and 'mem' calls
generate a genepred table for refseq transcripts
vcfCodingSnps needs a specific format (genepred) for the gene models. This format is close to what the UCSC table provides and can be obtained as described next.
outfolder="vcfCodingSnps-results" mkdir -p ${outfolder} # Go to http://genome.ucsc.edu/ # >>> Click "table" # >>> Specify the fields required (clade: mammal, genome:human etc.) # >>> In "track" (popup), choose ''RefSeq Genes'' # >>> In tables select "refGene" # >>> for the ''output format'', select fields shown in the hidden content below # >>> get output gene file (hg19_refGene.genepred.txt) and move it from ''Downloads'' to ${outfolder}
UCSC table screen-shots
- UCSC table selection for the refGene track
- Selection of fields to export to get a genepred output
Hit 'get output' do start downloading
You should get output like shown next with the top two records
name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds name2
NM_032291 chr1 + 66999824 67210768 67000041 67208778 25 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,67101698,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210768, SGIP1
NM_001080397 chr1 + 8378144 8404227 8378168 8404073 9 8378144,8384365,8385357,8385877,8390268,8395496,8397875,8399552,8403806, 8378246,8384786,8385450,8386102,8390996,8395650,8398052,8399758,8404227, SLC45A1
NM_032785 chr1 - 48998526 50489626 48999844 50489468 14 48998526,49000561,49005313,49052675,49056504,49100164,49119008,49128823,49332862,49511255,49711441,50162984,50317067,50489434, 48999965,49000588,49005410,49052838,49056657,49100276,49119123,49128913,49332902,49511472,49711536,50163109,50317190,50489626, AGBL4
Annotate the VCF data
infolder="Final_Results/vcf-venn/gold-set" outfolder="vcfCodingSnps-results" mkdir -p ${ourfolder} infile="chr21_common_NA18507.vcf" outfile="annotated_chr21_common_NA18507.vcf" # annotation files refgenome=ref/HiSeq_UCSC_hg19.fa genepred=$BIOTOOLS/vcfCodingSnps/geneLists/refGene.B37.txt # default parameters are: # --n1 [8], --n2 [3], --ns [5], --nc [24] ## option 1) # we use here the long names for parameters, see above for the short names # vcfCodingSnps does not read from bgzipped files # decompress (-d) the VCF data but keep the original archive (-c) bgzip -cd ${infolder}/${infile}.gz > ${outfolder}/${infile} vcfCodingSnps --refgenome ${refgenome} \ --snpfile ${outfolder}/${infile} \ --genefile ${genepred} \ --outfile ${outfolder}/${outfile} \ --log \ 2 > "${outfolder}/vcfCodingSnps_${outfile}.log" cd ## option 2) # we can also decompress on the fly with a redirection and short names vcfCodingSnps -r ${refgenome} \ -s <(bgzip -cd ${infolder}/${infile}.gz) \ -g ${genepred} \ -o ${outfolder}/${outfile} \ -l "${outfolder}/vcfCodingSnps_${outfile}.log"
bgzip command datails
Usage: bgzip [options] [file] ... Options: -c write on standard output, keep original files unchanged -d decompress -f overwrite files without asking -b INT decompress at virtual file pointer INT -s INT decompress INT bytes in the uncompressed file -h give this help
terminal output
################################################################################################## vcfCodingSNP 1.5 -- vcf SNP annotating tool (c) 2010.12 Yanming Li, Goncalo Abecasis Commend and (or) suggestions are welcome! Please send to liyanmin@umich.edu. ################################################################################################## The following parameters are in effect: Availabe Options Input Files : --refgenome [ref/HiSeq_UCSC_hg19.fa], --snpfile [/dev/fd/63], --genefile [/opt/biotools/vcfCodingSnps/geneLists/refGene.B37.txt], --n1 [8], --n2 [3], --ns [5] Output Files : --outfile [vcfCodingSnps-results/annotated_chr21_common_NA18507.vcf], --log [ON] Reading chromosome >chrM... Reading chromosome >chr1... Reading chromosome >chr2... Reading chromosome >chr3... Reading chromosome >chr4... Reading chromosome >chr5... Reading chromosome >chr6... Reading chromosome >chr7... Reading chromosome >chr8... Reading chromosome >chr9... Reading chromosome >chr10... Reading chromosome >chr11... Reading chromosome >chr12... Reading chromosome >chr13... Reading chromosome >chr14... Reading chromosome >chr15... Reading chromosome >chr16... Reading chromosome >chr17... Reading chromosome >chr18... Reading chromosome >chr19... Reading chromosome >chr20... Reading chromosome >chr21... Reading chromosome >chr22... Reading chromosome >chrX... Reading chromosome >chrY... start mapping mapping snp file... ...DONE! snp mapsize = 69969 mapping gene file... ...DONE! gene mapsize = 37973 start annotating... ... Error: The total number of bases between coding start and coding end are not right! genename=NM_182484 and base # btw cdstart and cdend=119 Error: The total number of bases between coding start and coding end are not right! genename=NM_001187 and base # btw cdstart and cdend=119 Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265 Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265 Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265 Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265 Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265 DONE! Complete annotating!!!
Inspect the results and isolate interesting calls
Looking for 'stop gain' calls (STOP_GAINED) in the VCF file is easy with a grep command
grep -w "STOP_GAINED" ${outfolder}/${outfile} | column -t chr21 30339120 . C A 224 . SF=0,1,2;STOP_GAINED=LTN1(NM_015565):Arg611stop[-] \ GT:PL:GQ 0/1:254,0,255:99 chr21 31913981 . AG A 214 . SF=0,1,2;STOP_GAINED=KRTAP19-6(NM_181612):Glu58stop[-] \ GT:PL:GQ 1/1:255,135,0:99 chr21 40584598 . C T 225 . SF=0,1,2;STOP_GAINED=BRWD1(NM_018963):Tyr1298stop[-];STOP_GAINED=BRWD1(NM_033656):Tyr1298stop[-] \ GT:PL:GQ 0/1:255,0,199:99 chr21 45656920 . C T 223 . SF=0,1,2;STOP_GAINED=ICOSLG(NM_015259):Leu79stop[-] \ GT:PL:GQ 0/1:253,0,255:99 chr21 45993819 . C T 27 . SF=0,1,2;INTRONIC=C21orf29(NM_144991)[-];STOP_GAINED=KRTAP10-4(NM_198687):Gln62stop[+] \ GT:PL:GQ 0/1:57,0,70:60
remember that the same variant may have different effects depending on the splice isoform of a given gene.
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS Exercise.7_annovar | NGS Exercise.8 ]