Hands-on introduction to NGS variant analysis
Dates: May 23 & 26, 2014; November 29 2013 & December 17 2013
[ Main_Page | NGS_data_analysis ]
A shorter and rewritten version of this original two-days training is available as a separate page: Hands-on introduction to NGS variant analysis-2016
The most recent version of this training (2018) can be found at (Hands-on_introduction_to_NGS_variant_analysis-2018)
Hands-on introduction to NGS variant analysis-laptop-file-list
Contents
Aims of the NGS DNA variant analysis 2-days session
Using a full publicly available chromosome read-set from one of the 1000 genomes[1] samples:
- Perform a complete analysis workflow including QC, read mapping, read coverage analysis, and variant calling against the human reference genome.
- Use command line and popular open source software to evaluate each step of a classical NGS variant workflow and feel the complexity of the task.
- Compare the obtained results with gold standard public data.
- Identify bottlenecks and propose possible downstream analyses.
- Become familiar with bash scripting and practice some elementary rules to write easy to recycle scripts (more on bash scripting here)
The skills acquired during this session should allow reproduction of the workflow by participants using their own reads and a reasonably-sized personal computer running Unix.
All scripts as well as the most important result files are put online to allow offline practice to trainees and web visitors
Not covered during this training
- This is not a Galaxy[2] training. The popular open-source graphical web platform can perform most steps of this training in an interactive and recordable manner but this is not today's topic. For more information about Galaxy, please refer to the official web-site and follow our training pages. An experimental BITS Galaxy mirror can also be found (when active) at http://galaxy.bits.vib.be[3].
- Unfortunately, this is not a GATK[4] training. Although excellent and largely used by the NGS community, the Broad Genome Analysis Tool Kit (GATK) is refered to several times in this documentation but is not used by the participants due to license limitations. Academic users can download the GAKT source after registring on the site with their University credentials. For more information about GATK, please refer to the GATK official pages.
More BITS Training Info
- On the VIB website: http://www.vib.be/en/training/research-training/courses/Pages/Hands-On-introduction-to-NGS-variant-analysis.aspx
- On the BITS website: https://www.bits.vib.be/index.php/training/201-ngs-variant-analysis
Summary
This training gives an introduction to the use of popular NGS analysis software packages at the command line. It reviews several exchangeable tools and provides hints to evaluate quality and content of Genome-Seq data. The code used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows.
The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). More information is available for that sample from the Coriell repository from which 1000g gDNA can be obtained [5].
This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.
Prerequisites
Skills required to follow this training:
- Linux command line basic skills are absolutely required to understand the code used throughout the session
- basic knowledge of human genome structure and nomenclature is necessary to estimate the training tasks
- basic knowledge of Illumina NGS read structure is also required for the same reason
- Inspiring NGS variant analysis custom functions and code that you can use and reuse
People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training ([6]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([7]).
Complementary readings
In addition to today's menu, read these pages to refresh your memory and learn about important formats and conventions used during this training.
- Introduction of NGS-formats used in classical NGS applications and used today in the hands-on
- Practical remarks about NGS variant analysis: laptop configuration and files presented in the sessions
- A survey of tools for variant analysis of next-generation genome sequencing data published early 2013[8] and still very actual.
Training Material
Most of the content produced during today's session was added on the BITS platform (see links at bottom of each exercise page)
Additional files were also stored that relate to:
- Documentation on the BITS server, including a PDF handout of the 2013 version of this training NGS-DNASeq_chr21-session.pdf
- Reference files obtained from public sources and relative to the same NA19507 genome.
Today's main tools
In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more [9]), we will use today several dedicated toolboxes including:
• bwa, the Burrow Wheeler Aligner for mapping reads to the reference genome
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'. There are
three alignment algorithms in BWA: `mem', `bwasw' and `aln/samse/sampe'. If
you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for
for the manual.
• bamUtil tools to edit BAM data
Set of tools for operating on SAM/BAM files. Version: 1.0.9; Built: Thu Oct 31 11:02:01 CET 2013 by root Tools to Rewrite SAM/BAM Files: convert - Convert SAM/BAM to SAM/BAM writeRegion - Write a file with reads in the specified region and/or have the specified read name splitChromosome - Split BAM by Chromosome splitBam - Split a BAM file into multiple BAM files based on ReadGroup findCigars - Output just the reads that contain any of the specified CIGAR operations. Tools to Modify & write SAM/BAM Files: clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags squeeze - reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!' mergeBam - merge multiple BAMs and headers appending ReadGroupIDs if necessary polishBam - adds/updates header lines & adds the RG tag to each record dedup - Mark Duplicates recab - Recalibrate Informational Tools validate - Validate a SAM/BAM File diff - Diff 2 coordinate sorted SAM/BAM files. stats - Stats a SAM/BAM File gapInfo - Print information on the gap between read pairs in a SAM/BAM File. Tools to Print Information In Readable Format dumpHeader - Print SAM/BAM Header dumpRefInfo - Print SAM/BAM Reference Name Information dumpIndex - Print BAM Index File in English readReference - Print the reference string for the specified region explainFlags - Describe flags Additional Tools bam2FastQ - Convert the specified BAM file to fastQs. Dummy/Example Tools readReference - Print the reference string for the specified region Usage: bam <tool> [<tool arguments>] The usage for each tool is described by specifying the tool with no arguments.
• samtools, mother of all NGS tools and dealing with SAM/BAM formats and with calling of variants (bcftools)
Version: 0.1.19+
Usage: samtools <command> [options]
Command: view SAM<->BAM conversion
sort sort alignment file
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
tview text alignment viewer
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
bedcov read depth per BED region
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes
bamshuf shuffle and group alignments by name
• bcftools (samtools) the classical variant caller used on samtools pileup results
Version: 0.1.19-44428cd
Usage: bcftools <command> <arguments>
Command: view print, extract, convert and call SNPs from BCF
index index BCF
cat concatenate BCFs
ld compute all-pair r^2
ldpair compute r^2 between requested pairs
• vcfutils.pl (samtools) the vcf toolbox used to filter bcftools results
Command: subsam get a subset of samples
listsam list the samples
fillac fill the allele count field
qstats SNP stats stratified by QUAL
hapmap2vcf convert the hapmap format to VCF
ucscsnp2vcf convert UCSC SNP SQL dump to VCF
varFilter filtering short variants (*)
vcf2fq VCF->fastq (**)
Notes: Commands with description ending with (*) may need bcftools
specific annotations.
• Picard (2.x) [10], the 'broadly' used java wrapper for samtools with added functionalities
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>
## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar
## get help in Picard by typing
java -jar $PICARD picard.jar <command> -h
• samtools (htslib), the developer version of samtools
Version: 0.2.0-rc7-99-gd60458f (using htslib 0.2.0-rc7-37-gdb8055a)
Usage: samtools <command> [options]
Commands:
-- indexing faidx index/extract FASTA index index alignment -- editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header rmdup remove PCR duplicates targetcut cut fosmid regions (for fosmid pool only) -- file operations bamshuf shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group -- stats bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion</syntaxhighlight>
• bcftools (htslib), the samtools (dev) accompanying toolbox
Version: 0.2.0-rc7-52-ge3fe732 (using htslib 0.2.0-rc7-37-gdb8055a)
Usage: bcftools <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
query transform VCF/BCF into user-defined formats
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.
• bedtools, a multiusage BED/SAM/VCF manipulation suite
Read it online at http://bedtools.readthedocs.org/ [12]
usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistrubute intervals in a genome.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand Replicate lines based on lists of values in columns.
[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.
• IGVtools, the toolbox used to prepare files for IGV (compress, index, and more)
Usage: igvtools [command] [options] [input file/dir] [other arguments]
Command: version print the version number
sort sort an alignment file by start position.
index index an alignment file
toTDF convert an input file (cn, gct, wig) to tiled data format (tdf)
count compute coverage density for an alignment file
formatexp center, scale, and log2 normalize an expression file
gui Start the gui
help <command> display this help message, or help on a specific command
See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help
• annotate_variation.pl; most used Annovar command (but not only!)
Usage: annotate_variation.pl [arguments] <query-file|table-name> <database-location> Optional arguments: -h, --help print help message -m, --man print complete documentation -v, --verbose use verbose output Arguments to download databases or perform annotations --downdb download annotation database --geneanno annotate variants by gene-based annotation (infer functional consequence on genes) --regionanno annotate variants by region-based annotation (find overlapped regions in database) --filter annotate variants by filter-based annotation (find identical variants in database) Arguments to control input and output --outfile <file> output file prefix --webfrom <string> specify the source of database (ucsc or annovar or URL) (downdb operation) --dbtype <string> specify database type --buildver <string> specify genome build version (default: hg18 for human) --time print out local time during program run --comment print out comment line (those starting with #) in output files --exonsort sort the exon number in output line (gene-based annotation) --transcript_function use transcript name rather than gene name (gene-based annotation) --hgvs use HGVS format for exonic annotation (c.122C>T rather than c.C122T) (gene-based annotation) --separate separately print out all functions of a variant in several lines (gene-based annotation) --seq_padding create a new file with cDNA sequence padded by this much either side (gene-based annotation) --(no)firstcodondel treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation) --aamatrix <file> specify an amino acid substitution matrix file (gene-based annotation) --colsWanted <string> specify which columns to output by comma-delimited numbers (region-based annotation) --scorecolumn <int> the column with scores in DB file (region-based annotation) --gff3dbfile <file> specify a DB file in GFF3 format (region-based annotation) --bedfile <file> specify a DB file in BED format file (region-based annotation) --gff3attribute output all fields in GFF3 attribute (default: ID and score only) --genericdbfile <file> specify a DB file in generic format (filter-based annotation) --vcfdbfile <file> specify a DB file in VCF format (filter-based annotation) --otherinfo print out additional columns in database file (filter-based annotation) --infoasscore use INFO field in VCF file as score in output (filter-based annotation) Arguments to fine-tune the annotation procedure --batchsize <int> batch size for processing variants per batch (default: 5m) --genomebinsize <int> bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno) --expandbin <int> check nearby bin to find neighboring genes (default: 2m/genomebinsize) --neargene <int> distance threshold to define upstream/downstream of a gene --exonicsplicing report exonic variants near exon/intron boundary as 'exonic;splicing' variants --score_threshold <float> minimum score of DB regions to use in annotation --reverse reverse directionality to compare to score_threshold --normscore_threshold <float> minimum normalized score of DB regions to use in annotation --rawscore output includes the raw score (not normalized score) in UCSC Browser Track --minqueryfrac <float> minimum percentage of query overlap to define match to DB (default: 0) --splicing_threshold <int> distance between splicing variants and exon/intron boundary (default: 2) --indel_splicing_threshold <int> if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold) --maf_threshold <float> filter 1000G variants with MAF above this threshold (default: 0) --sift_threshold <float> SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05) --precedence <string> comma-delimited to specify precedence of variant function (default: exonic>intronic...) --indexfilter_threshold <float> controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9) Arguments to control memory usage --memfree <int> ensure minimum amount of free system memory (default: 0) --memtotal <int> limit total amount of memory used by ANNOVAR (default: 0, unlimited, in the order of kb) --chromosome <string> examine these specific chromosomes in database file Function: annotate a list of genetic variants against genome annotation databases stored at local disk. Example: #download gene annotation database (for hg18 build) and save to humandb/ directory annotate_variation.pl -downdb refGene humandb/ annotate_variation.pl -buildver mm9 -downdb mousedb/ annotate_variation.pl -downdb -webfrom annovar esp6500si_all humandb/ #gene-based annotation of variants in the varlist file (by default --geneanno is ON) annotate_variation.pl ex1.human humandb/ #region-based annotate variants annotate_variation.pl -regionanno -dbtype cytoBand ex1.human humandb/ annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.human humandb/ #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants annotate_variation.pl -filter -dbtype 1000g2012apr_all -maf 0.01 ex1.human humandb/ annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/ annotate_variation.pl -filter -dbtype avsift ex1.human humandb/ Version: $LastChangedDate: 2013-08-23 11:32:41 -0700 (Fri, 23 Aug 2013) $
• snpEff commands used to add annotations to VCF data
snpEff version SnpEff 3.4 (build 2013-11-23), by Pablo Cingolani Usage: snpEff [command] [options] [files] Available commands: [eff] : Calculate effect of variants. Default: eff (no command or 'eff'). build : Build a SnpEff database. buildNextProt : Build a SnpEff for NextProt (using NextProt's XML files). cds : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. closest : Annotate the closest genomic region. count : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval. databases : Show currently available databases (from local config file). download : Download a SnpEff database. dump : Dump to STDOUT a SnpEff database (mostly used for debugging). genes2bed : Create a bed file from a genes list. len : Calculate total genomic length for each marker type. protein : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. spliceAnalysis : Perform an analysis of splice sites. Experimental feature. Run 'java -jar snpEff.jar command' for help on each specific command Generic options: -c , -config : Specify config file -d , -debug : Debug mode (very verbose). -dataDir <path> : Override data_dir parameter from config file. -h , -help : Show this help and exit -noLog : Do not report usage statistics to server -q , -quiet : Quiet mode (do not show any messages or errors) -v , -verbose : Verbose mode
• snpSift commands used to filter snpEff-annotated VCF data
SnpSift version SnpSift 3.4 (build 2013-11-23), by Pablo Cingolani Usage: java -jar SnpSift.jar [command] params... Command is one of: alleleMat : Create an allele matrix output. annotate : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted. annMem : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries. caseControl : Compare how many variants are in 'case' and in 'control' groups; calculate p-values. ccs : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect. coevolution : Co-evoution case control model (this feature is alpha). concordance : Concordance metrics between two VCF files. covMat : Create an covariance matrix output (allele matrix as input). dbnsfp : Annotate with multiple entries from dbNSFP. <EXPERIMENTAL> extractFields : Extract fields from VCF file into tab separated format. filter : Filter using arbitrary expressions geneSets : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.) gt : Add Genotype to INFO fields and remove genotype fields when possible. gwasCat : Annotate using GWAS catalog hwe : Calculate Hardy-Weimberg parameters and perform a godness of fit test. intersect : Intersect intervals (genomic regions). intervals : Keep variants that intersect with intervals. intIdx : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve join : Join files by genomic region. phastCons : Annotate using conservation scores (phastCons). private : Annotate if a variant is private to a family or group. rmRefGen : Remove reference genotypes. rmInfo : Remove INFO fields. sift : Annotate using SIFT scores from a VCF file. split : Split VCF by chromosome. tstv : Calculate transiton to transversion ratio. varType : Annotate variant type (SNP,MNP,INS,DEL or MIXED). vcf2tped : Convert VCF to TPED.
Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!
Additional tools were installed from downloads or from source and are referenced to web links in the respective pages
Tools in the spotlight
Some tools found during the preparation of this training but not covered today include:
- seqtk if you need to convert between FastA <-> FastQ (by Heng Li)
overview
Version: 1.0-r32
Command: seq common transformation of FASTA/Q
comp get the nucleotide composition of FASTA/Q
sample subsample sequences
subseq extract subsequences from FASTA/Q
trimfq trim FASTQ using the Phred algorithm
hety regional heterozygosity
mutfa point mutate FASTA at specified positions
mergefa merge two FASTA/Q files
randbase choose a random base from hets
cutN cut sequence at long N
listhet extract the position of each het
- bioawk, a modified AWK to dig into BED/VCF/SAM/... (by Heng Li)
overview
###Examples
1. List the supported formats:
# bioawk -c help
bed: 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
sam 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
gff: 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
fastx: 1:name 2:seq 3:qual 4:comment
2. Extract unmapped reads without header:
bioawk -c sam 'and($flag,4)' aln.sam.gz
2b.Extract unmapped read pairs with header: potential novel sequences absent from the reference genome and/or unmappable reads due to sequence content
bioawk -Hc sam 'and($flag,13)' als.sam
3. Extract mapped reads with header:
bioawk -Hc sam '!and($flag,4)'
4. Reverse complement FASTA:
bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz
5. Create FASTA from SAM (uses revcomp if FLAG & 16)
samtools view aln.bam | \
bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'
6. Print the genotypes of sample `foo` and `bar` from a VCF:
grep -v ^## in.vcf | bioawk -tc hdr '{print $foo,$bar}'
- BEDOPS, a serious challenger for bedtools and is certainly worth a try. Get the current PDF manual from https://media.readthedocs.org/pdf/bedops/latest/bedops.pdf[13]
Find more tools and answers
There are many tools out there, finding them is often the easiest part. You are welcome to try as many as you wish and improve results obtained with our selected toolbox. When seeking advice, please consider using:
- SeqAnswers[14] for all what relates to NGS
- BioStar[15] for questions about biocomputing and scripting for biologists
- stackoverflow[16] for questions related to coding.
- bioinformatics.ca directory[17] to find bioinformatics tools sorted by categories.
- The EBI online training propose many very nice training sessions with slides and exercises.
Today's Hands-On Training
Housekeeping rules
All exercises were performed by the trainer on one of the BITS laptops and the results moved to a new subfolder called Final_Results on the BITS laptops. This ensures that you can take up any exercise and will find the necessary input data in the corresponding folder even if you did not perform the previous exercises.
The results that You will generate will be stored in similarly named folders created directly under your working folder and will not overwrite the 'good' versions stored in Final_Results.
Additionally, key result files were copied on the BITS file server and can be downloaded from links present at the bottom of each exercise page.
All scripts described during the training are store in the scripts folder. These scripts, as well as all shorter command blocks presented in the exercises expect that your terminal points to the training folder. The training folder is known by the laptop as BASE. You should always run your commands while being in the BASE-camp. To make sure you are at the right address, type at any time: cd $BASE ⌅ Enter.
The Exercises
Code developed in these exercises
Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page https://github.com/splaisan/NGS-Variant-Analysis-training [18] (please feed-us back about any issue with this page).
Exercises with a coloured text background are long and not required for today. You can keep them for later.
- NGS Exercise.1: Download Illumina data, perform QC and correct SAM/BAM formatting errors
- NGS Exercise.2: Extract fastQ reads from BAM and perform basic read quality control
- NGS Exercise.3: Map paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)
- NGS Exercise.3b: Perform QC on the BWA alignment data
- Added material: NGS Exercise.3c: Map paired end reads to the human reference genome hg19 using Bowtie2
- Soon to be added material: NGS Exercise.3d: Map paired end reads to the human reference genome hg19 using STAR
- Soon to be added material: NGS Exercise.3e: Map paired end reads to the human reference genome hg19 using RTG
- NGS Exercise.4: Compute read coverage from the mapping results
- NGS Exercise.4b: Retrieve unsequenced regions of chr21 for Sanger re-sequencing
- NGS Exercise.5: Call variations by comparing the aligned reads with the reference sequence
- NGS Exercise.5_GATK: Call variations using the GATK (not covered today)
- NGS Exercise.6: Compare and intersect VCF files
- NGS Exercise.7: Add annotations to VCF files
- NGS_Exercise.7_annovar after converting the data
- NGS_Exercise.7_vcfCodingSnps inline using the veteran app
- NGS_Exercise.7_SnpEff do it using the emerging standard
- NGS Exercise.8: Visualize results in the Interactive Genome Viewer (IGV)
- NGS Exercise.9: Extract SOD1 locus specific data from large SAM/BAM, BED, or VCF data files
- Q&A added during the NGS variant analysis training
- Q&A added during the 2nd day NGS variant analysis training
Bonus material
- Call_variants_with_samtools_1.0 Call chr21 variants from newly mapped BAM data using samtools (htslib) and bcftools (htslib) OR varscan2 (added: 09-2014)
Conclusion
This hands-on training session was aimed at presenting a simplified although complete workflow for NGS variant analysis and to train users in taking advantage of the many command-line interface resources.
The number of research groups generating NGS data is increasing upon time and the demand for data analysis is getting greater than the capacity of the facilities within the Institute and too expensive through outsourced commercial help. For these reason, more and more students and staff scientists feel the 'urgent need' to perform their own NGS data analysis.
Although a certain level of computer infrastructure is required in order to perform this analysis in ideal conditions, knowledge and training on NGS analysis seem to be today's main limiting factors in research groups.
This session demonstrates that a minimum of training is sufficient to allow classical NGS analysis on a laptop computer. For higher throughput an more complex analysis, additional learning and more resources will be required but a lot can be initiated using the many available online resources and communities.
Your feedback to this two-days general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of this kind, please contact us and we will organize additional hands-on based on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.
PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission.
References:
- ↑ http://www.1000genomes.org
- ↑ http://main.g2.bx.psu.edu
- ↑ http://galaxy.bits.vib.be
- ↑ http://www.broadinstitute.org/gatk/
- ↑ http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507
- ↑ http://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics
- ↑ http://wiki.bits.vib.be/index.php/NGS_data_analysis
- ↑
Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
A survey of tools for variant analysis of next-generation genome sequencing data.
Brief Bioinform: 2014, 15(2);256-78
[PubMed:23341494] ##WORLDCAT## [DOI] (I p) - ↑ http://www.gnu.org/software/coreutils/manual/coreutils.pdf
- ↑ http://picard.sourceforge.net
- ↑ https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
- ↑ http://bedtools.readthedocs.org/
- ↑ https://media.readthedocs.org/pdf/bedops/latest/bedops.pdf
- ↑ http://seqanswers.com SeqAnswers
- ↑ http://www.biostars.org BioStar
- ↑ http://stackoverflow.com stackoverflow
- ↑ http://bioinformatics.ca/links_directory
- ↑ https://github.com/splaisan/NGS-Variant-Analysis-training
[ Main_Page ]