NGS Exercise.8
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.7 | NGS Exercise.9 ]
# updated 2014 version
Visualize all results in the Interactive Genome Viewer (IGV)
Contents
Variant visualization tools
Due to the intrinsic noise in NGS data, Visualisation is vital in order to evaluate results and generate hypothesis prior to final validation. Visualisation also allows presenting the obtained data side by side with pieces of evidence originating from the zillion resources out-there.
What tool should you use?
Many popular visualization tools are available that present data in ways it can be better evaluated. We list below a coupe of open-source free products, many other free or commercial tools exist but listing them all is not the focus of this document.
- Circos is a very popular graphic package to display circular plots with annotations . Please see one simple Circos example NGS_Exercise.4
- A recent paper describes the Personal Genome Browser (PGB) designed to visualize human variants (http://www.pgbrowser.org/). As NA18507 is part of the available genomes, you can use this resource to confirm variants found during this session and add our VCF data to the browser. You may also want to read the accompanying paper [1] that lists many more tools that can be used in the same context.
- Other related tools are cited in the PGB publication cited above or in A survey of tools for variant analysis of next-generation genome sequencing data [2]
IGV: The Interactive Genome Viewer
Some users may prefer other gene browsers and are welcome to keep using them but we chose IGV as favorite genome viewer because it is well documented, already contains links to precious public datasets including Encode and UCSC tacks and because it is easy to deploy, fast and robust.
- IGV supports a large number of standard formats that can be found on their information pages (<http://www.broadinstitute.org/igv/FileFormats>).
- Loading data into IGV is simple and often makes use of the IGVTools; also available from the Broad institute. IGV tools are efficient when tiling or indexing is required to speed-up data parsing (large bed files). Other lighter formats can be opened as such without performance drop.
- IGV can also be used in combination with bedtools to export screen-shots in batch for loci listed in a bed file. To discover this command, please type *bedtools igv -h* (previously *bedToIgv*) in your terminal window for more information about this feature.
IGV will be used during the session and is not further documented here, please refer to the very extensive online help content on the 'omnipresent' Broad Institute site for more information. (<http://www.broadinstitute.org/igv>).
example: samtools view to fetch region-specific reads from the Internet
samtools view can read **directly from the internet** when the repository also contains the index file
This can be really interesting when you need reads from a small region of the genome and the BAM file on the internet is the full genome (many GBs).
The following code example extracts reads mapping to the DSCAM (your favorite gene?) region on chr21 directly from the WEB for the NA18507 genome sequenced by Complete Genomics. Only evidence reads are reported (those participating to Complete Genomics local realignment to find variants) which explains why reads are grouped in a region and not present all over the chromosome length. Regions without reads on the left and right of the screen did not contain putative calls and were not stored as evidenceSupport data (but the corresponding raw reads are present elsewhere in the CG data for computing coverage).
# add 1kb at either ends of the DSCAM locus: chr21:42,218,289-42,219,325 # result are stored to a small local file and indexed. outfolder="public_info/CG_variants/" baseurl="ftp://ftp.1000genomes.ebi.ac.uk" path="/vol1/ftp/data/NA18507/cg_data/" datafile="NA18507_buffy_SRR822930.mapped.COMPLETE_GENOMICS.CGworkflow2_2_evidenceSupport.YRI.high_coverage.20130401.bam" filter="chr21:42,218,289-42,219,325" samtools view -b -h ${baseurl}/${path}/${datafile} ${filter} > \ ${outfolder}/DSCAM_region.bam && samtools index \ ${outfolder}/DSCAM_region.bam
The BAM file can be visualised in **IGV** to show the sequencing read pileup and supporting evidence for variant calls (here zoomed on the exon1 of human DSCAM with T->C polymorphism at position 42'218'928 in blue: chr21:42,218,289-42,219,325).
IGV view of the first DSCAM exon (reverse orientation)
A region was created to point to DSCAM (> Regions/Region Navigator). You can also type DSCAM in the search box to find the locus
REM: Additional information about colours found in alignments can be obtained from the online IGV documentation.
- http://www.broadinstitute.org/igv/AlignmentData[3]
- http://www.broadinstitute.org/software/igv/interpreting_pair_orientations[4]
IGV color codes for reads
feature jumping and exon jumping
- To feature-jump, you select a feature track and press ⎈ Ctrl-F for forward, ⎈ Ctrl-B for back.
- To exon-jump, you select a feature track and press ⇧ Shift-⎈ Ctrl-F to center the next exon in your view, ⇧ Shift-⎈ Ctrl-B to move back one exon.
example: samtools faidx to fetch a region of the reference genome
samtools faidx should have been applied to the reference genome fasta file in order to use this functionality
This operation is often required to get the context of a variant and possibly align it to other sequences for control. The samtools faidx indexing of the reference genome allows very rapid retrieval of sequence using the same command with start and stop coordinates. An example below extracts 50bps either sides of the variant cited above.
IGV view of a hom variant in the first DSCAM exon
- a track for bcftools calls
- the track showing all reads mapped in the BWA mem experiment
- the read coverage depth derived by IGV from the BAM track and showing no coverage around the Start codon (while Complete Genomics reads did show significant coverage in the same window - see above)
# requires indexing the reference file (done) # => samtools faidx ref/HiSeq_UCSC_hg19.fa # the following command adds 50bps at either ends of the DSCAM variant position: chr21: 42218928 reference=ref/HiSeq_UCSC_hg19.fa target="chr21:42218878-42218978" samtools faidx ${reference} ${target} >chr21:42218878-42218978 TCGGCAGGTGGAGAGAGCCGCAGACGCGGGCGTGAGCTCATCCCGGGCACTCGGCGCCCA GAATGCGCAGCCAGCGGCCTCCCGTGCGCCAGCCCTCACCC
IGV view around a variant
- translation of the currently viewed 5'-UTR (not relevnant! => alway inspect the type of exon displayed and its orientation)
- the read coverage depth derived by IGV from the BAM track
- the variant metrics extracted from the BWA-mem VCF track (mouse over the red histogram bar in the vcf-track to get this popup window)
example: bedtools getfasta to fetch multiple regions of the reference genome
What if you need many regions, obtained from bedtools manipulations or other sources? bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file
getfasta command details
Tool: bedtools getfasta (aka fastaFromBed) Version: v2.17.0-111-gab7bbbe Summary: Extract DNA sequences into a fasta file based on feature coordinates. Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> Options: -fi Input FASTA file -bed BED/GFF/VCF file of ranges to extract from -fi -fo Output file (can be FASTA or TAB-delimited) -name Use the name field for the FASTA header -split given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons) -tab Write output in TAB delimited format. - Default is FASTA format. -s Force strandedness. If the feature occupies the antisense, strand, the sequence will be reverse complemented. - By default, strand information is ignored.
# download only coding exons for hg19 chr21 using the UCSC table browser in BED format infolder=ref infile=hg19_chr21-refGene-CDS.bed head -5 ${infolder}/${infile}
chr21 10906904 10907040 NM_199259_cds_0_0_chr21_10906905_r 0 - chr21 10908824 10908895 NM_199259_cds_1_0_chr21_10908825_r 0 - chr21 10910306 10910399 NM_199259_cds_2_0_chr21_10910307_r 0 - chr21 10914362 10914442 NM_199259_cds_3_0_chr21_10914363_r 0 - chr21 10916369 10916475 NM_199259_cds_4_0_chr21_10916370_r 0 -
# the result is a redundant bed where alternative gene models share common exons # sort the lines by chr+coord sort -k 1V,1 -k 2n,2 -k 3n,3 ${infolder}/${infile} > ${infolder}/srt_${infile} head -5 ${infolder}/srt_${infile}
chr21 10906904 10907040 NM_199259_cds_0_0_chr21_10906905_r 0 - chr21 10906904 10907040 NM_199260_cds_0_0_chr21_10906905_r 0 - chr21 10906904 10907040 NM_199261_cds_0_0_chr21_10906905_r 0 - chr21 10908824 10908895 NM_199259_cds_1_0_chr21_10908825_r 0 - chr21 10908824 10908895 NM_199260_cds_1_0_chr21_10908825_r 0 -
# merge overlapping exons and keep alternative names separated by '|' bedtools merge -nms -delim "|" -i ${infolder}/srt_${infile} > ${infolder}/merged_${infile} head -5 ${infolder}/merged_${infile} | sed -e "s/.\{60\}/&\n/g"
chr21 10906904 10907040 NM_199259_cds_0_0_chr21_10906905_r|NM_199260_cds_0_0_chr21_10906905_r|NM_199261_cds_0_0_chr21_10906905_r chr21 10908824 10908895 NM_199259_cds_1_0_chr21_10908825_r|NM_199260_cds_1_0_chr21_10908825_r|NM_199261_cds_1_0_chr21_10908825_r chr21 10910306 10910399 NM_199259_cds_2_0_chr21_10910307_r|NM_199260_cds_2_0_chr21_10910307_r|NM_199261_cds_2_0_chr21_10910307_r chr21 10914362 10914442 NM_199259_cds_3_0_chr21_10914363_r|NM_199260_cds_3_0_chr21_10914363_r|NM_199261_cds_3_0_chr21_10914363_r chr21 10916369 10916475 NM_199259_cds_4_0_chr21_10916370_r|NM_199260_cds_4_0_chr21_10916370_r|NM_199261_cds_4_0_chr21_10916370_r
# extract fasta for each line infasta=ref/HiSeq_UCSC_hg19.fa outfasta=ref/merged_${infile%%.bed}.fasta bedtools getfasta -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta} # inspect the first three sequences # !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g"
>chr21:10906904-10907040 TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCAC GGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCATT TTTTGGTAGATAAAGC >chr21:10908824-10908895 CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTAT GTAGGAAGATT >chr21:10910306-10910399 CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAAT TAATATTTTGTCTGTTGTAATGTTATCAAGTAC
# OR extract fasta for each line and use name field for the fasta header (-name) infasta=ref/HiSeq_UCSC_hg19.fa outfasta=ref/merged_name_${infile%%.bed}.fasta bedtools getfasta -name -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta} # inspect the first three sequences # !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g"
>NM_199259_cds_0_0_chr21_10906905_r|NM_199260_cds_0_0_chr21 _10906905_r|NM_199261_cds_0_0_chr21_10906905_r TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCA CGGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCA TTTTTTGGTAGATAAAGC >NM_199259_cds_1_0_chr21_10908825_r|NM_199260_cds_1_0_chr21 _10908825_r|NM_199261_cds_1_0_chr21_10908825_r CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTA TGTAGGAAGATT >NM_199259_cds_2_0_chr21_10910307_r|NM_199260_cds_2_0_chr21 _10910307_r|NM_199261_cds_2_0_chr21_10910307_r CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAA TTAATATTTTGTCTGTTGTAATGTTATCAAGTAC
example: add public data to explain and support private results
The example above with BWA mem shows a region where no reads mapped to the first exon of DSCAM and should be explainable using genome features in that region. Two common reasons for the absence of maping are the presence of repeated elements and a very high level of GC in a locus. To support these, two public tracks can be added from the built-in resources accessible through >File >Load from Server
adding public annotation stracks
result of adding public tracks
download exercise files
Download exercise files here
References:
- ↑
Liran Juan, Mingxiang Teng, Tianyi Zang, Yafeng Hao, Zhenxing Wang, Chengwu Yan, Yongzhuang Liu, Jie Li, Tianjiao Zhang, Yadong Wang
The personal genome browser: visualizing functions of genetic variants.
Nucleic Acids Res: 2014, 42(Web Server issue);W192-7
[PubMed:24799434] ##WORLDCAT## [DOI] (I p) - ↑
Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
A survey of tools for variant analysis of next-generation genome sequencing data.
Brief Bioinform: 2014, 15(2);256-78
[PubMed:23341494] ##WORLDCAT## [DOI] (I p) - ↑ http://www.broadinstitute.org/igv/AlignmentData
- ↑ http://www.broadinstitute.org/software/igv/interpreting_pair_orientations
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.7 | NGS Exercise.9 ]