NGS Exercise.7 annovar
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS_Exercise.7_vcfCodingSnps | NGS Exercise.8 ]
# updated 2014 version
Add annotations to variant calling format
Contents
- 1 Annovar overview
- 2 Annovar step by step
- 2.1 Convert vcf to annovar format
- 2.2 Genomic annotation of the converted data using Annovar
- 2.3 Intergenic variants annotated by Annovar
- 2.4 Exomic variants annotated by Annovar
- 2.5 Identify candidate deleterious variants
- 2.6 Filter dbSNP variants based on dbSNP.138
- 2.7 BEDtools to create a subset of the original VCF from a subset of annotated records
- 2.8 VCFtools to create a subset of the original VCF from a subset of annotated records
- 3 download exercise files
Annovar overview
Variant lists are important but often long and not easy to evaluate. In order to rank candidate variant for validation, we need to know where these variants occur and what effect they may have on the regulation of genes when close or included into a gene region or on the protein product when falling into exons.
Whatever software you apply to clean your variant calls, you will still need to validate what you get from NGS :o)
Several tools exist to annotate variant lists and predict variant effects, we present here two popular tools
- Annovar (http://www.openbioinformatics.org/annovar/[1], publication[2])
- the recent EnsEMBL Variant Effect Predictor tool (VEP:http://www.ensembl.org/info/docs/variation/vep/index.html[3])
- and the more recent and UCSC graphical Variant Annotation Integrator (VAI:http://genome.ucsc.edu/cgi-bin/hgVai[4]).
- vcfCodingSnps <http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps>[5]
- SnpEff and SnpSift <http://snpeff.sourceforge.net>[6]
Other valuable variant annotation tools exist, including:
- variant tools <http://varianttools.sourceforge.net/Annotation/HomePage>[7]
- VAT <http://vat.gersteinlab.org/download.php>[8]
- VariantAnnotation (a Bioconductor package) <http://www.bioconductor.org/packages/2.12/bioc/html/VariantAnnotation.html>[9]
And many more that you can find with Dr G. search now
Annovar was written in Perl and is very straightforward to use when one has bascic knowledge of CLI and file parsing. It is not a fully automated process and requires decisions at run time. Annovar requires installing a large number of reference and annotation files on your computer and therefore was not very adapted to today training. Only few relevant files were installed. Annovar supports most if not all UCSC table databases with minor or no modifications, it is also compatible with several key data formats including BED and VCF.
A considerable amount of resources can be attached to Annovar and used for annotation and filtering. Please read the Annovar database page [10] to get a clear idea.
Publication: Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010.
We recommend Annovar for users that need confidentiality and/or will regularly annotate variants since it allows building workflow scripts to do so in a straightforward way.
annotate_variation.pl command parameters
Usage: annotate_variation.pl [arguments] <query-file|table-name> <database-location> Optional arguments: -h, --help print help message -m, --man print complete documentation -v, --verbose use verbose output Arguments to download databases or perform annotations --downdb download annotation database --geneanno annotate variants by gene-based annotation (infer functional consequence on genes) --regionanno annotate variants by region-based annotation (find overlapped regions in database) --filter annotate variants by filter-based annotation (find identical variants in database) Arguments to control input and output --outfile <file> output file prefix --webfrom <string> specify the source of database (ucsc or annovar or URL) (downdb operation) --dbtype <string> specify database type --buildver <string> specify genome build version (default: hg18 for human) --time print out local time during program run --comment print out comment line (those starting with #) in output files --exonsort sort the exon number in output line (gene-based annotation) --transcript_function use transcript name rather than gene name (gene-based annotation) --hgvs use HGVS format for exonic annotation (c.122C>T rather than c.C122T) (gene-based annotation) --separate separately print out all functions of a variant in several lines (gene-based annotation) --seq_padding create a new file with cDNA sequence padded by this much either side (gene-based annotation) --(no)firstcodondel treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation) --aamatrix <file> specify an amino acid substitution matrix file (gene-based annotation) --colsWanted <string> specify which columns to output by comma-delimited numbers (region-based annotation) --scorecolumn <int> the column with scores in DB file (region-based annotation) --gff3dbfile <file> specify a DB file in GFF3 format (region-based annotation) --bedfile <file> specify a DB file in BED format file (region-based annotation) --gff3attribute output all fields in GFF3 attribute (default: ID and score only) --genericdbfile <file> specify a DB file in generic format (filter-based annotation) --vcfdbfile <file> specify a DB file in VCF format (filter-based annotation) --otherinfo print out additional columns in database file (filter-based annotation) --infoasscore use INFO field in VCF file as score in output (filter-based annotation) Arguments to fine-tune the annotation procedure --batchsize <int> batch size for processing variants per batch (default: 5m) --genomebinsize <int> bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno) --expandbin <int> check nearby bin to find neighboring genes (default: 2m/genomebinsize) --neargene <int> distance threshold to define upstream/downstream of a gene --exonicsplicing report exonic variants near exon/intron boundary as 'exonic;splicing' variants --score_threshold <float> minimum score of DB regions to use in annotation --reverse reverse directionality to compare to score_threshold --normscore_threshold <float> minimum normalized score of DB regions to use in annotation --rawscore output includes the raw score (not normalized score) in UCSC Browser Track --minqueryfrac <float> minimum percentage of query overlap to define match to DB (default: 0) --splicing_threshold <int> distance between splicing variants and exon/intron boundary (default: 2) --indel_splicing_threshold <int> if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold) --maf_threshold <float> filter 1000G variants with MAF above this threshold (default: 0) --sift_threshold <float> SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05) --precedence <string> comma-delimited to specify precedence of variant function (default: exonic>intronic...) --indexfilter_threshold <float> controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9) Arguments to control memory usage --memfree <int> ensure minimum amount of free system memory (default: 0) --memtotal <int> limit total amount of memory used by ANNOVAR (default: 0, unlimited, in the order of kb) --chromosome <string> examine these specific chromosomes in database file Function: annotate a list of genetic variants against genome annotation databases stored at local disk. Example: #download gene annotation database (for hg18 build) and save to humandb/ directory annotate_variation.pl -downdb refGene humandb/ annotate_variation.pl -buildver mm9 -downdb mousedb/ annotate_variation.pl -downdb -webfrom annovar esp6500si_all humandb/ #gene-based annotation of variants in the varlist file (by default --geneanno is ON) annotate_variation.pl ex1.human humandb/ #region-based annotate variants annotate_variation.pl -regionanno -dbtype cytoBand ex1.human humandb/ annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.human humandb/ #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants annotate_variation.pl -filter -dbtype 1000g2012apr_all -maf 0.01 ex1.human humandb/ annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/ annotate_variation.pl -filter -dbtype avsift ex1.human humandb/ Version: $LastChangedDate: 2013-08-23 11:32:41 -0700 (Fri, 23 Aug 2013) $
Annovar step by step
Convert vcf to annovar format
Annovar requires a strict format for input, the format is well documented in the web documentation and a command is ready to process our VCF data as follows:
# we use here the data obtained from the tripple vcf intersect described above infolder=Final_Results/vcf-venn/gold-set infile=chr21_common_NA18507.vcf.gz outfolder=annovar-results mkdir -p ${outfolder} # we now decompress the data 'on the fly' and feed it to annovar convert2annovar.pl -format vcf4 \ <(zcat ${infolder}/${infile}) > \ ${outfolder}/chr21_common_NA18507.annovar
NOTICE: Finished reading 69940 lines from VCF file NOTICE: A total of 69851 locus in VCF file passed QC threshold, representing 62895 SNPs \ (42796 transitions and 20099 transversions) and 7074 indels/substitutions NOTICE: Finished writting 62886 SNPs (42793 transitions and 20093 transversions) and 6965 indels/substitutions for 1 sample
The obtained file is ready for annotation. We will use the hg19 RefSeq gene table to add information in the intergenic and exonic parts hit by our chr21 variants.
# inspect the first 5 calls in the VCF file infolder=Final_Results/vcf-venn/gold-set infile=chr21_common_NA18507.vcf.gz zgrep -v "^##" ${infolder}/${infile} | head -6 | column -t
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GAIIx-chr21-BWA.mem chr21 9467416 . C T 86 . SF=0,1 GT:PL:GQ 0/1:116,0,126:99 chr21 9467417 . A C 111 . SF=0,1,2 GT:PL:GQ 0/1:141,0,95:98 chr21 9471670 . A G 37.8 . SF=0,1,2 GT:PL:GQ 1/1:69,6,0:10 chr21 9472931 . T G 52 . SF=0,1,2 GT:PL:GQ 1/1:84,9,0:16 chr21 9473159 . A G 52 . SF=0,1,2 GT:PL:GQ 1/1:84,9,0:16
# inspect the top-5 lines in the annovar result ( | column -t to make it pretty) infolder=annovar-results head -5 ${infolder}/chr21_common_NA18507.annovar | column -t
chr21 9467416 9467416 C T het 86 . chr21 9467417 9467417 A C het 111 . chr21 9471670 9471670 A G hom 37.8 . chr21 9472931 9472931 T G hom 52 . chr21 9473159 9473159 A G hom 52 .
As seen by comparing both extracts, the before last 2 columns report respectively, the ploidy of the call, and the quality score.
The next step will use Annovar to find variants faling in RefSeq exons or close to a RefSeq gene.
# obtain up-to-date annotation from the annovar file server for hg19:RefSeq annovardb=$BIOTOOLS"/annovar/humandb/" #not run# annotate_variation.pl -buildver hg19 -downdb refGene ${annovardb}
When run, the command reports downloading several files
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ... OK NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refLink.txt.gz ... OK NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGeneMrna.fa.gz ... OK NOTICE: Uncompressing downloaded files NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the '/opt/biotools/annovar/humandb' directory
Genomic annotation of the converted data using Annovar
# define path infolder=Final_Results/annovar-results outfolder=annovar-results mkdir -p ${outfolder} annovardb=$BIOTOOLS"/annovar/humandb/" # start annotating annotate_variation.pl \ --buildver hg19 \ --geneanno \ --outfile ${outfolder}/common_NA18507 \ ${infolder}/chr21_common_NA18507.annovar \ ${annovardb}
The comment reports downloading several files and complains about discrepancies between our ref alleles and bases in annovar hg19 fasta file. This is often seen and reflects discrepancies between genbank mRNA records used for annotation and the reference genome sequence (no worries)
NOTICE: Reading gene annotation from /opt/biotools/annovar/humandb/hg19_refGene.txt ... Done with 47806 transcripts \ (including 9829 without coding sequence annotation) for 25224 unique genes NOTICE: Reading FASTA sequences from /opt/biotools/annovar/humandb/hg19_refGeneMrna.fa ... Done with 277 sequences WARNING: A total of 3 sequences cannot be found in /opt/biotools/annovar/humandb/hg19_refGeneMrna.fa (example: NM_001291412 NM_001291411 NM_001290224) WARNING: A total of 301 sequences will be ignored due to lack of correct ORF annotation --------------------------------------------------------------------------------------- WARNING: 7 exonic SNPs have WRONG reference alleles specified in your input file! WARNING: An example input line is <chr21 35467645 35467645 A G hom 222 .> WARNING: ANNOVAR can still annotate exonic_variant_function for the mutation correctly! WARNING: you may have used wrong -buildver, or specified incorrect reference allele, or used outdated mRNA FASTA file! --------------------------------------------------------------------------------------- NOTICE: Finished gene-based annotation on 69851 genetic variants in annovar-results/chr21_common_NA18507.annovar NOTICE: Output files were written to annovar-results/common_NA18507.variant_function, \ annovar-results/common_NA18507.exonic_variant_function
Intergenic variants annotated by Annovar
infolder=annovar-results head -5 ${infolder}/common_NA18507.variant_function | column -t
command results
intergenic NONE(dist=NONE),MIR3648(dist=358416) chr21 9467416 9467416 C T het 86 . intergenic NONE(dist=NONE),MIR3648(dist=358415) chr21 9467417 9467417 A C het 111 . intergenic NONE(dist=NONE),MIR3648(dist=354162) chr21 9471670 9471670 A G hom 37.8 . intergenic NONE(dist=NONE),MIR3648(dist=352901) chr21 9472931 9472931 T G hom 52 . intergenic NONE(dist=NONE),MIR3648(dist=352673) chr21 9473159 9473159 A G hom 52 .
Note that all the information present in the vcf-converted input is reported in annovar annotation files. This information can now be used to filter and decide of which variants to believe and validate.
You can easily create a bed formatted version of this output with gawk that can be viewed in IGV. The score column uses the quality score Q and the depth is stored in the name (empty in this case)
Process intergenic calls
When columns need to be reordered, AWK is here to help, the following command presents the first line of input with numbers to help us build a correct awk script.
infolder=Final_Results/annovar-results testfile=${infolder}/common_NA18507.variant_function head -1 ${testfile} | transpose -t | awk '{print NR,$0}' 1 intergenic 2 NONE(dist=NONE),MIR3648(dist=358416) 3 chr21 4 9467416 5 9467416 6 C 7 T 8 het 9 86 10 .
infolder=Final_Results/annovar-results outfolder=annovar-results mkdir -p ${outfolder} # switch back to 0-based open coordinates # replace spaces by '_' in name field gawk 'BEGIN{FS="\t"; OFS="\t"} { namef=$6"/"$7"|"$8"|"$2"|"$3"|DP="$10 gsub(/ /, "_", namef) print $3, $4-1, $5, namef ,"+", $9 }' ${infolder}/common_NA18507.variant_function > \ ${outfolder}/common_NA18507.variant_function.bed
infolder=Final_Results/annovar-results head -5 ${infolder}/common_NA18507.variant_function.bed
intergenic BED results
==> common_NA18507.variant_function.bed <== chr21 9467415 9467416 C/T|het|NONE(dist=NONE),MIR3648(dist=358416)|chr21|DP=. + 86 chr21 9467416 9467417 A/C|het|NONE(dist=NONE),MIR3648(dist=358415)|chr21|DP=. + 111 chr21 9471669 9471670 A/G|hom|NONE(dist=NONE),MIR3648(dist=354162)|chr21|DP=. + 37.8 chr21 9472930 9472931 T/G|hom|NONE(dist=NONE),MIR3648(dist=352901)|chr21|DP=. + 52 chr21 9473158 9473159 A/G|hom|NONE(dist=NONE),MIR3648(dist=352673)|chr21|DP=. + 52
Exomic variants annotated by Annovar
The obtained files report variants with additional fields where the coding was added for CDS variants and gene symbols for all exomic calls.
As seen below, the added columns inform about the type of variation, the change as compared to the RefSeq mRNA sequence, and finally, when applies, the change at protein level (incl frameshifts and added/removed start and stop CDS). A number of possible annotations are listed on the annovar web site. (please read the full documentation).
infolder=annovar-results head -5 ${infolder}/common_NA18507.exonic_variant_function
bits@bits-Aspire-V3-571G ~/NGS_DNASeq-training $ head -5 ${infolder}/common_NA18507.exonic_variant_function line1351 stopgain SNV TPTE:NM_199259:exon23:c.C1518A:p.C506X,TPTE:NM_199260:exon22:c.C1458A:p.C486X,TPTE:\ NM_199261:exon24:c.C1572A:p.C524X, chr21 10906989 10906989 C T het 22. line1369 stoploss SNV TPTE:NM_199259:exon21:c.G1391C:p.X464S,TPTE:NM_199260:exon20:c.G1331C:p.X444S,TPTE:\ NM_199261:exon22:c.G1445C:p.X482S, chr21 10910311 10910311 T G hom 22. line1370 synonymous SNV TPTE:NM_199259:exon21:c.T1362C:p.V454V,TPTE:NM_199260:exon20:c.T1302C:p.V434V,TPTE:\ NM_199261:exon22:c.T1416C:p.V472V, chr21 10910340 10910340 C G het 13. line1371 nonsynonymous SNV TPTE:NM_199259:exon21:c.G1355C:p.S452T,TPTE:NM_199260:exon20:c.G1295C:p.S432T,TPTE:\ NM_199261:exon22:c.G1409C:p.S470T, chr21 10910347 10910347 A G ho222 . line1413 nonsynonymous SNV TPTE:NM_199259:exon18:c.T1102G:p.C368G,TPTE:NM_199260:exon17:c.T1042G:p.C348G,TPTE:\ NM_199261:exon19:c.T1156G:p.C386G, chr21 10920098 10920098 T C ho222 .
Process exomic calls (! fields are different with one additional column in the Annovar file)
infolder=Final_Results/annovar-results outfolder=annovar-results mkdir -p ${outfolder} # switch back to 0-based open coordinates # replace spaces by '_' in name field gawk 'BEGIN{FS="\t"; OFS="\t"} { namef=$7"/"$8"|"$9"|"$2"|"$3"|DP="$11 gsub(/ /, "_", namef) print $4, $5-1, $6, namef, "+", $10 }' ${infolder}/common_NA18507.exonic_variant_function > \ ${outfolder}/common_NA18507.exonic_variant_function.bed
infolder=Final_Results/annovar-results head -5 ${infolder}/common_NA18507.exonic_variant_function.bed
exomic BED results
==> common_NA18507.exonic_variant_function.bed <== chr21 10906988 10906989 C/T|het|stopgain_SNV|TPTE:NM_199259:exon23:c.C1518A:p.C506X,\ TPTE:NM_199260:exon22:c.C1458A:p.C486X,TPTE:NM_199261:exon24:c.C1572A:p.C524X,|DP=. \ + 225 chr21 10910310 10910311 T/G|hom|stoploss_SNV|TPTE:NM_199259:exon21:c.G1391C:p.X464S,\ TPTE:NM_199260:exon20:c.G1331C:p.X444S,\ TPTE:NM_199261:exon22:c.G1445C:p.X482S,|DP=. \ + 222 chr21 10910339 10910340 C/G|het|synonymous_SNV|TPTE:NM_199259:exon21:c.T1362C:p.V454V,\ TPTE:NM_199260:exon20:c.T1302C:p.V434V,TPTE:NM_199261:exon22:c.T1416C:p.V472V,|DP=. \ + 139 chr21 10910346 10910347 A/G|hom|nonsynonymous_SNV|TPTE:NM_199259:exon21:c.G1355C:p.S452T,\ TPTE:NM_199260:exon20:c.G1295C:p.S432T,TPTE:NM_199261:exon22:c.G1409C:p.S470T,|DP=. \ + 222 chr21 10920097 10920098 T/C|hom|nonsynonymous_SNV|TPTE:NM_199259:exon18:c.T1102G:p.C368G,\ TPTE:NM_199260:exon17:c.T1042G:p.C348G,TPTE:NM_199261:exon19:c.T1156G:p.C386G,|DP=. \ + 222
This BED file can be loaded in IGV very easily to create a track used to guide you through the exploration of each variant
Identify candidate deleterious variants
As we work today with genome information from a healthy donnor (1000g YRI person), we do not really expect finding disease associated variants.
We will however look for potentially detrimental exomic mutations based on gene annotations produced by annovar. One way to do this is to first identify interesting annotation categories in the annovar results then find where they occur in the genome. A simple example is provided here to find all possible exomic effects stored in column #2 of the annovar raw data.
infolder=Final_Results/annovar-results cut -f 2 ${infolder}/common_NA18507.exonic_variant_function | \ sort | uniq -c | sort -n
Note the second sort -n to rank the lines by counts; as you see here again, piping different commands in a meaningful sequence is very powerful and has no limits
1 frameshift insertion 1 nonframeshift insertion 2 frameshift deletion 2 stopgain SNV 2 stoploss SNV 3 nonframeshift deletion 172 synonymous SNV 187 nonsynonymous SNV
Searching for the unique 'stop-loss' call is easy with a grep command on the BED file (could also be done on the raw data).
infolder=annovar-results grep "stoploss" ${infolder}/common_NA18507.exonic_variant_function.bed
chr21 42821113 42821113 T/C|het|stoploss SNV|MX1:NM_001178046:exon12:c.A1323C:p.X441Y,|DP=. + 225 chr21 45994841 45994841 A/C|het|stoploss SNV|KRTAP10-4:NM_198687:exon1:c.A1206C:p.X402C,|DP=. + 124
Slight differences between RefSeq files used by Annovar and by other tools (SnpEff) will lead to slight differences in results. Some splice variants might be erroneously annotated in one file or simply absent leading to different VCF annotations added. You should always control selected results by eye using IGV to ensure that the reads really support the annotation
Searching for 'synonymous calls' is easy with a grep -w command on the annovar raw data or BED data.
infolder=Final_Results/annovar-results # on raw data, the space before SNV can be used grep -w "synonymous" ${infolder}/common_NA18507.exonic_variant_function.bed | head -5 # on BED data, _SNV shoul dbeadded grep -w "synonymous_SNV" ${infolder}/common_NA18507.exonic_variant_function.bed | head -5
chr21 10910339 10910340 C/G|het|synonymous_SNV|TPTE:NM_199259:exon21:c.T1362C:p.V454V,\ TPTE:NM_199260:exon20:c.T1302C:p.V434V,\ TPTE:NM_199261:exon22:c.T1416C:p.V472V,|DP=. \ + 139 chr21 10970016 10970017 C/T|het|synonymous_SNV|TPTE:NM_199259:exon6:c.T111A:p.T37T,\ TPTE:NM_199260:exon6:c.A111A:p.E37E,\ TPTE:NM_199261:exon6:c.A111A:p.X37X,|DP=. \ + 225 chr21 15918576 15918577 G/C|hom|synonymous_SNV|SAMSN1:NM_022136:exon1:c.C6G:p.L2L,|DP=. \ + 222 chr21 16340288 16340289 C/T|het|synonymous_SNV|NRIP1:NM_003489:exon4:c.G225A:p.G75G,|DP=. \ + 225 chr21 17203784 17203785 A/T|het|synonymous_SNV|USP25:NM_001283041:exon16:c.A1830T:p.A610A,\ USP25:NM_013396:exon16:c.A1830T:p.A610A,\ USP25:NM_001283042:exon16:c.A1830T:p.A610A,|DP=. \ + 153
When using 'grep', adding the -w option requires that the searched word be surrounded by spaces or line ends, this avoids finding a string within a longer word (eg 'synonymous' within 'nonsynonymous')
IGV is used to confirm the call from the aligned reads (chr21:45994841-45994841, at the end of the TSPEAR gene)
IGV at the position of the call
About half of the reads support the het call.
NOTE: Annovar is not limited to annotating your data, it can also filter it to keep or trash variants based on genomic context or any external annotation database you can provide. Please refer to the Annovar online documentation for more info. As incentive, the next block of text was taken from the site and shows particular Annovar filters (non exhaustive).
ANNOVAR Function II: Region-based annotation - Most conserved element annotation - Transcription factor binding site annotation - Identify cytogenetic band for genetic variants - Identify variants located in segmental duplications - Identify previously reported structural variants in DGV (Database of Genomic Variants) - Identify variants reported in previously published GWAS (Genome-wide association studies - Identify variants in ENCODE annotated regions (transcribed regions, H3K4Me1 regions, H3K4Me3 regions, H3K27Ac regions, DNaseI hypersensitivity regions, transcription factor ChIP-Seq regions, etc) - Identify dbSNP variants in user-specified regions - Identify variants in other genomic regions annotated with other functions - Annotating custom-made databases conforming to GFF3 (Generic Feature Format version 3) - Identifying variants specified in regions in BED file
Filter dbSNP variants based on dbSNP.138
As example, the Annovar filter command is applied here with the dbSNP137 database obtained from the UCSC table repository.
It is known that dbSNP versions > .129 contain disease variants and are not really suited for filtering
Recently, the UCSC team produced different subsets of dbSNP files including a set limited to known disease variants (flagged) and a set excluding known disease variants (common) that are better suited for filtering. Please refer to the UCSC table page dedicated to variants (see screenshot in the collapsed area below and [11]) for more info (your can download .txt.gz files from your browserfrom the UCSC FTP page[12])
dbSNP subsets
The selection of SNPs with a minor allele frequency of 1% or greater is an attempt to identify variants that appear to be reasonably common in the general population. Taken as a set, common variants should be less likely to be associated with severe genetic diseases due to the effects of natural selection, following the view that deleterious variants are not likely to become common in the population. However, the significance of any particular variant should be interpreted only by a trained medical geneticist using all available information.
The dbSNP.138 data is present in the following tracks: * All SNPs(138) - all SNPs from dbSNP mapping to reference assembly. * Common SNPs(138) - SNPs with >= 1% minor allele frequency (MAF), mapping only once to reference assembly. * Flagged SNPs(138) - SNPs < 1% minor allele frequency (MAF) (or unknown), mapping only once to reference assembly, flagged in dbSnp as "clinically associated" -- not necessarily a risk allele! * Mult. SNPs(138) - SNPs mapping in more than one place on reference assembly.
The table page looks as follows and allows to download data in your browser
infolder=Final_Results/annovar-results infile=chr21_common_NA18507.annovar # annovar DB folder annovardb=$BIOTOOLS"/annovar/humandb/" # choose version here dbsnpver=138 # download 'all' and 'common' dbSNP.${dbsnpver} data from the UCSC # 1] download full database (~2GB gz file!! this can take some time!) # the following lines were commented out to prevent you from re-running them ;-) ## annotate_variation.pl \ ## --buildver hg19 \ ## --downdb snp${dbsnpver} \ ## ${annovardb} # filter all dbSNP.${dbsnpver} variants # NOT RUN either, read comment below ## annotate_variation.pl \ ## --buildver hg19 \ ## --filter \ ## --dbtype snp${dbsnpver} \ ## ${infolder}/${infile} \ ## ${annovardb}
It is SAFER to consider only common SNVs with >1% minor allele variants (unlikely disease-associated)
# 2] download common records only # the following lines were commented out to prevent you from re-running them ;-) ## annotate_variation.pl \ ## --buildver hg19 \ ## --downdb snp${dbsnpver}Common \ ## ${annovardb} infolder=Final_Results/annovar-results infile=chr21_common_NA18507.annovar # filter common dbSNP.${dbsnpver} variants annotate_variation.pl \ --buildver hg19 \ --filter \ --dbtype snp${dbsnpver}Common \ ${infolder}/${infile} \ ${annovardb}
# 3] other databases present on the UCSC ftp server include 'Flagged' and 'Mult' subsets # corresponding to the disease variants and SNPS mapping to multiple locations (likely artefacts).
The run produces two files, one for variants that correspond to the filter database (dropped) and one for the variants absent from the databases (filtered). The files are named by the input name followed by indication of the build and database used as follows (<infile>.<build>_<snp${dbsnpver}>_dropped|filtered)
filtering results for dbSNP.138_Common
NOTICE: Variants matching filtering criteria are written to annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_dropped, other variants are written to annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_filtered NOTICE: Processing next batch with 69851 unique variants in 69851 input lines NOTICE: Database index loaded. Total number of bins is 2719867 and the number of bins to be scanned is 25890 NOTICE: Scanning filter database /opt/biotools/biodata/humandb/hg19_snp138Common.txt...Done # resulting records are as follows: wc -l annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_* 55634 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_dropped 14217 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_filtered 69851 total head -5 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_dropped | column -t snp138Common rs28880928 chr21 10703925 10703925 T C hom 222 . snp138Common rs76134637 chr21 10703974 10703974 C A het 225 . snp138Common rs76601778 chr21 10707729 10707729 T G het 163 . snp138Common rs28844326 chr21 10707784 10707784 C T het 191 . snp138Common rs74784077 chr21 10707945 10707945 C T het 18.1 . head -5 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_filtered | column -t chr21 32548924 32548924 - TAAACCAAG het 217 . chr21 34394243 34394243 C A het 214 . chr21 28674151 28674151 T - het 217 . chr21 9476303 9476303 G T hom 63 . chr21 37876293 37876294 GA - het 217 .
BEDtools to create a subset of the original VCF from a subset of annotated records
As seen above, converting VCF for Annovar use leads to data loss that may be good to have in downstream analysis. One way to subset the full VCF is to intersect it with a subset of Annovar annotated lines using Bedtools.
infolder=vcf-venn/gold-set outfolder=annovar-results mkdir -p ${outfolder} infile=chr21_common_NA18507.vcf.gz # isolate the nonsynonymous calls counted above to create a filter set # 187 nonsynonymous SNV grep "nonsynonymous" ${outfolder}/common_NA18507.exonic_variant_function.bed > ${outfolder}/nonsynonymous_subset.bed ## INTERSECT with BEDTOOLS # run bedtools intersect with specific options # -header replicates the vcf header in the output # -wa reports entries in 'a' = vcf file # -f 1 requires 100% reciprocal intersect (identity) bedtools intersect -wa -header -f 1 -a ${infolder}/${infile} \ -b ${outfolder}/nonsynonymous_subset.bed >\ ${outfolder}/nonsynonymous_subset.vcf # count results grep -v "^#" -c ${outfolder}/nonsynonymous_subset.vcf 187
It occures that different row counts are found when using differently formatted data. This is a consequence of the BED format being 0-based open and the VCF 1-based open and is not what we expect. When intersecting VCF data, it is preferable to use VCFtools rather than BED tools
VCFtools to create a subset of the original VCF from a subset of annotated records
infolder=vcf-venn/gold-set infile=chr21_common_NA18507.vcf outfolder=annovar-results mkdir -p ${outfolder} # isolate the nonsynonymous calls counted above to create a filter set grep -c "nonsynonymous" ${outfolder}/common_NA18507.exonic_variant_function.bed # 187 nonsynonymous SNV grep "nonsynonymous" ${outfolder}/common_NA18507.exonic_variant_function.bed > ${outfolder}/nonsynonymous_subset.bed ## INTERSECT with VCFTOOLS # decompress the vcf.gz file to the output folder # vcftools sometimes accepts .gz files ... with --gzvcf but this is here an issue gzip -cd ${infolder}/${infile}.gz > ${outfolder}/${infile} vcftools --vcf ${outfolder}/${infile} \ --out ${outfolder}/vcftools-nonsynonymous_subset \ --recode \ --bed ${outfolder}/nonsynonymous_subset.bed
VCFtools - v0.1.12a (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf annovar-results/chr21_common_NA18507.vcf --out annovar-results/vcftools-nonsynonymous_subset --recode --bed annovar-results/nonsynonymous_subset.bed After filtering, kept 1 out of 1 Individuals Outputting VCF file... Read 187 BED file entries. After filtering, kept 186 out of a possible 69845 Sites Run Time = 0.00 seconds
# count results grep -v "^#" -c ${outfolder}/vcftools-nonsynonymous_subset.recode.vcf 186
download exercise files
Download exercise files here
References:
- ↑ http://www.openbioinformatics.org/annovar/
- ↑
Kai Wang, Mingyao Li, Hakon Hakonarson
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.
Nucleic Acids Res: 2010, 38(16);e164
[PubMed:20601685] ##WORLDCAT## [DOI] (I p) - ↑ http://www.ensembl.org/info/docs/variation/vep/index.html
- ↑ http://genome.ucsc.edu/cgi-bin/hgVai
- ↑ http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps
- ↑ http://snpeff.sourceforge.net
- ↑ http://varianttools.sourceforge.net/Annotation/HomePage
- ↑ http://vat.gersteinlab.org/download.php
- ↑ http://www.bioconductor.org/packages/2.12/bioc/html/VariantAnnotation.html
- ↑ http://www.openbioinformatics.org/annovar/annovar_db.html
- ↑ http://genome.ucsc.edu/cgi-bin/hgTables
- ↑ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS_Exercise.7_vcfCodingSnps | NGS Exercise.8 ]