NGS Exercise.3e
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d ]
Align paired end reads to the human reference genome hg19 using RTG 3.6.1
# added in 2016
Contents
- 1 Why use the RTG software instead of the better known BWA, Bowtie2, or STAR?
- 2 Build the RTG index files from the HiSeq_UCSC_hg19.fa reference
- 3 Build the RTG index files from the paired fastq read files
- 4 Map reads to the reference
- 5 Call variants by comparing mappings to the reference genome sequence
- 6 download exercise files
Why use the RTG software instead of the better known BWA, Bowtie2, or STAR?
RTG (RealTime Genomics) is a less well known aligner suite that was introduced in the end of the years 2010 and has since then pretty much evolved into a very complete product. RTG can be used free of charge for academia and we strongly support its use (please read more and download it from http://www.realtimegenomics.com/[1])
RTG v3.6.1 commands
Usage: rtg COMMAND [OPTION]... rtg RTG_MEM=16G COMMAND [OPTION]... (e.g. to set maximum memory use to 16 GB) Type 'rtg help COMMAND' for help on a specific command. The following commands are available: Data formatting: format convert sequence data files to SDF cg2sdf convert Complete Genomics reads to SDF sdf2fasta convert SDF to FASTA sdf2fastq convert SDF to FASTQ sdf2sam convert SDF to SAM/BAM sdf2cg convert SDF formatted CG data to CG TSV Read mapping: map read mapping mapf read mapping for filtering purposes cgmap read mapping for Complete Genomics data Protein search: mapx translated protein search Assembly: assemble assemble reads into long sequences addpacbio add Pacific Biosciences reads to an assembly Variant detection: calibrate create calibration data from SAM/BAM files svprep prepare SAM/BAM files for structural variant analysis sv find structural variants discord detect structural variant breakends using discordant reads coverage calculate depth of coverage from SAM/BAM files snp call variants from SAM/BAM files family call variants for a family following Mendelian inheritance somatic call variants for a tumor/normal pair population call variants for multiple potentially-related individuals lineage call de novo variants in a cell lineage avrbuild build an AVR model from training examples avrpredict run AVR on a VCF file cnv call CNVs from paired SAM/BAM files Metagenomics: species estimate species frequency in metagenomic samples similarity calculate similarity matrix and nearest neighbor tree Pipelines: composition-meta-pipeline run metagenomic composition pipeline functional-meta-pipeline run metagenomic functional pipeline composition-functional-meta-pipeline run metagenomic composition and functional pipelines Simulation: genomesim generate simulated genome sequence cgsim generate simulated reads from a sequence readsim generate simulated reads from a sequence readsimeval evaluate accuracy of mapping simulated reads popsim generate a VCF containing simulated population variants samplesim generate a VCF containing a genotype simulated from a population childsim generate a VCF containing a genotype simulated as a child of two parents denovosim generate a VCF containing a derived genotype containing de novo variants samplereplay generate the genome corresponding to a sample genotype Utility: bgzip compress a file using block gzip index create a tabix index extract extract data from a tabix indexed file aview ASCII read mapping and variant viewer sdfstats print statistics about an SDF sdfsplit split an SDF into multiple parts sdfsubset extract a subset of an SDF into a new SDF sdfsubseq extract a subsequence from an SDF as text sam2bam convert SAM file to BAM file and create index sammerge merge sorted SAM/BAM files samstats print statistics about a SAM/BAM file samrename rename read id to read name in SAM/BAM files mapxrename rename read id to read name in mapx output files chrstats check expected chromosome coverage levels from mapping calibration files mendelian check a multisample VCF for Mendelian consistency vcfstats print statistics about variants contained within a VCF file vcfmerge merge single-sample VCF files into a single multi-sample VCF vcffilter filter records within a VCF file vcfannotate annotate variants within a VCF file vcfsubset create a VCF file containing a subset of the original columns vcfeval evaluate called variants for agreement with a baseline variant set pedfilter filter and convert a pedigree file pedstats print information about a pedigree file avrstats print statistics about an AVR model rocplot plot ROC curves from vcfeval ROC data files ncbi2tax create RTG taxonomy from NCBI taxdump files taxfilter create and manipulate RTG taxonomy files taxstats summarize and verify the taxonomy in an SDF usageserver run a local server for collecting RTG command usage information version print version and license information license print license information for all commands help print this screen or help for specified command
Build the RTG index files from the HiSeq_UCSC_hg19.fa reference
The command creates hash files from the fasta sequence of the reference genome. As always, you are warmly advised to read the documentation. A new ENV variable was created in .bashrc and exported that points to the default location for rtg indexes. This variable was named RTG_INDEXES and is used below.
rtg format command arguments
Usage: rtg format [OPTION]... -o SDF FILE+ [OPTION]... -o SDF -I FILE [OPTION]... -o SDF -l FILE -r FILE Converts the contents of sequence data files (FASTA/FASTQ/SAM/BAM) into the RTG Sequence Data File (SDF) format. File Input/Output -f, --format=FORMAT format of input (Must be one of [fasta, fastq, sam-se, sam-pe]) (Default is fasta) -I, --input-list-file=FILE file containing a list of input read files (1 per line) -l, --left=FILE left input file for FASTA/FASTQ paired end data -o, --output=SDF name of output SDF -p, --protein input is protein. If this option is not specified, then the input is assumed to consist of nucleotides -q, --quality-format=FORMAT format of quality data for fastq files (use sanger for Illumina 1.8+) (Must be one of [sanger, solexa, illumina]) -r, --right=FILE right input file for FASTA/FASTQ paired end data FILE+ input sequence files. May be specified 0 or more times Filtering --duster treat lower case residues as unknowns --exclude=STRING exclude input sequences based on their name. If the input sequence contains the specified string then that sequence is excluded from the SDF. May be specified 0 or more times --select-read-group=STRING when formatting from SAM/BAM input, only include reads with this read group ID --trim-threshold=INT trim read ends to maximise base quality above the given threshold Utility --allow-duplicate-names disable checking for duplicate sequence names -h, --help print help on command-line flag usage --no-names do not include name data in the SDF output --no-quality do not include quality data in the SDF output --sam-rg=STRING|FILE file containing a single valid read group SAM header line or a string in the form "@RG\tID:READGROUP1\tSM:BACT_SAMPLE\tPL:ILLUMINA"
rtg format -o HiSeq_UCSC_hg19 HiSeq_UCSC_hg19.fa
index building output
Formatting FASTA data Processing "HiSeq_UCSC_hg19.fa" Input Data Files : HiSeq_UCSC_hg19.fa Format : FASTA Type : DNA Number of sequences: 25 Total residues : 3095693983 Minimum length : 16571 Maximum length : 249250621 Output Data SDF-ID : 53b72eea-afb1-4f1d-8e98-4c916af3175d Number of sequences: 25 Total residues : 3095693983 Minimum length : 16571 Maximum length : 249250621
Build the RTG index files from the paired fastq read files
Now the turn for the reads which will also be converted to a hash. This is unusual with other programs which take fastq as input while RTG also encodes the reads for speed improvement.
rtg format \ -f fastq -q sanger \ -l reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz \ -r reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz \ --sam-rg="@RG\tID:NA18507\tSM:GAIIx-chr21-RTG_3.6.1\tPL:ILLUMINA" \ -o hg19_rtg-mapping/NA18507_reads
index building output
Formatting paired-end FASTQ data Processing left arm "reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz" Processing right arm "reads/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz" Too many sequences to check for duplicate sequence names. Input Data Files : shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz Format : FASTQ Type : DNA Number of pairs : 7495097 Number of sequences: 14990194 Total residues : 1499019400 Minimum length : 100 Maximum length : 100 Output Data SDF-ID : 7f8c9245-f353-41cc-ab67-83ea559b3bce Number of pairs : 7495097 Number of sequences: 14990194 Total residues : 1499019400 Minimum length : 100 Maximum length : 100
Map reads to the reference
The mapping command can be tuned to fit specific needs, we use here default parameters which are often good enough for a diploid genome and the kind of application described in this training.
rtg map command arguments
Usage: rtg map [OPTION]... -o DIR -t SDF -i SDF|FILE [OPTION]... -o DIR -t SDF -l FILE -r FILE Aligns sequence reads onto a reference template, creating an alignments file in the Sequence Alignment/Map (SAM) format. File Input/Output -F, --format=FORMAT input format for reads (Must be one of [sdf, fasta, fastq, sam-se, sam-pe]) (Default is sdf) -i, --input=SDF|FILE input read set -l, --left=FILE left input file for FASTA/FASTQ paired end reads --no-merge output mated/unmated/unmapped alignments into separate SAM/BAM files -o, --output=DIR directory for output -q, --quality-format=FORMAT format of quality data for fastq files (use sanger for Illumina 1.8+) (Must be one of [sanger, solexa, illumina]) -r, --right=FILE right input file for FASTA/FASTQ paired end reads --sam output the alignment files in SAM format -t, --template=SDF SDF containing template to map against Sensitivity Tuning --aligner-band-width=FLOAT aligner indel band width scaling factor, fraction of read length allowed as an indel (Default is 0.5) --aligner-mode=STRING pick the aligner to use (Must be one of [auto, table, general]) (Default is auto) --bed-regions=FILE restrict calibration to mappings falling within the supplied BED regions --gap-extend-penalty=INT penalty for a gap extension during alignment (Default is 1) --gap-open-penalty=INT penalty for a gap open during alignment (Default is 19) -c, --indel-length=INT guaranteed number of positions that will be detected in a single indel (Default is 1) -b, --indels=INT guaranteed minimum number of indels which will be detected (Default is 1) -M, --max-fragment-size=INT maximum permitted fragment size when mating paired reads (Default is 1000) -m, --min-fragment-size=INT minimum permitted fragment size when mating paired reads (Default is 0) --mismatch-penalty=INT penalty for a mismatch during alignment (Default is 9) -d, --orientation=STRING orientation for proper pairs (Must be one of [fr, rf, tandem, any]) (Default is any) --pedigree=FILE genome relationships pedigree containing sex of sample --repeat-freq=INT maximum repeat frequency (Default is 90%) --sex=SEX sex of sample (Must be one of [male, female, either]) --soft-clip-distance=INT soft clip alignments if indels occur INT bp from either end (Default is 5) -s, --step=INT step size (Default is word size) -a, --substitutions=INT guaranteed minimum number of substitutions which will be detected (Default is 1) --unknowns-penalty=INT penalty for unknown nucleotides during alignment (Default is 5) -w, --word=INT word size (Default is 22, or read length / 2, whichever is smaller) Filtering --end-read=INT exclusive upper bound on read id --start-read=INT inclusive lower bound on read id Reporting --all-hits output all alignments meeting thresholds instead of applying mating and N limits --max-mated-mismatches=INT maximum mismatches for mappings across mated results, alias for --max-mismatches (as absolute value or percentage of read length) (Default is 10%) -e, --max-mismatches=INT maximum mismatches for mappings in single-end mode (as absolute value or percentage of read length) (Default is 10%) -n, --max-top-results=INT maximum number of top equal results output per read (Default is 5) -E, --max-unmated-mismatches=INT maximum mismatches for mappings of unmated results (as absolute value or percentage of read length) (Default is 10%) --no-unmapped do not output unmapped reads --no-unmated do not output unmated reads when in paired-end mode --sam-rg=STRING|FILE file containing a single valid read group SAM header line or a string in the form "@RG\tID:READGROUP1\tSM:BACT_SAMPLE\tPL:ILLUMINA" --top-random output a single random top hit per read Utility -h, --help print help on command-line flag usage --legacy-cigars use legacy cigars in output --no-calibration do not produce calibration files -Z, --no-gzip do not gzip the output --no-index do not produce indexes for output files --no-svprep do not perform structural variant processing --read-names use read name in output instead of read id (Uses more RAM) --tempdir=DIR directory used for temporary files (Defaults to output directory) -T, --threads=INT number of threads (Default is the number of available cores)
rtg map -i hg19_rtg-mapping/NA18507_reads -o hg19_rtg-mapping/mapping-results -t $RTG_INDEXES/HiSeq_UCSC_hg19
rtg map output
ARM MAPPINGS left right both 6877822 6877822 13755644 91.8% mated uniquely (NH = 1) 10938 10938 21876 0.1% mated ambiguously (NH > 1) 262904 164976 427880 2.9% unmated uniquely (NH = 1) 6167 5480 11647 0.1% unmated ambiguously (NH > 1) 0 0 0 0.0% unmapped due to read frequency (XC = B) 46318 51570 97888 0.7% unmapped with no matings but too many hits (XC = C) 28596 70149 98745 0.7% unmapped with poor matings (XC = d) 129 141 270 0.0% unmapped with too many matings (XC = e) 31316 33709 65025 0.4% unmapped with no matings and poor hits (XC = D) 0 0 0 0.0% unmapped with no matings and too many good hits (XC = E) 230907 280312 511219 3.4% unmapped with no hits 7495097 7495097 14990194 100.0% total
RTG mapping plots] | |
---|---|
Alignment-Score-Distribution-(Mated-Only) | |
Alignment-Score-Distribution-(Unmated-Only) | |
Fragment-Length-Distribution | |
Mapping-Quality-(MAPQ)-Distribution |
Call variants by comparing mappings to the reference genome sequence
The many RTG applications comprise:
- calling SNV variants with rtg snp
- preparing structural variant analysis with rtg svprep
- calling structural variants with rtg sv
- finding chromosomal breakpoints with rtg discord
- analysis of coverage with rtg coverage
- call CNV with rtg cnv
Other type of analysis are present in RTG and we strongly suggest to look at them, they include somatic, family, population, and many more...
We will only apply here the first tool to obtain calls for snps and short indels. We apply default parameters to that run too. Please look at the many possible additional arguments below.
rtg snp command arguments
Usage: rtg snp [OPTION]... -o DIR -t SDF FILE+ [OPTION]... -o DIR -t SDF -I FILE Calls sequence variants, such as single nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs) and Indels, from a set of alignments reported in co-ordinate sorted SAM/BAM files. File Input/Output -I, --input-list-file=FILE file containing a list of SAM/BAM format files (1 per line) containing mapped reads --no-calibration if set, ignore mapping calibration files -o, --output=DIR directory for output -t, --template=SDF SDF of the reference genome the reads have been mapped against FILE+ SAM/BAM format files containing mapped reads. May be specified 0 or more times Sensitivity Tuning --bed-regions=FILE BED file containing regions to process --exclude-mated exclude all mated SAM records --exclude-unmated exclude all unmated SAM records --keep-duplicates don't detect and filter duplicate reads based on mapping position -m, --machine-errors=STRING if set, force sequencer machine settings. One of [default, illumina, ls454_se, ls454_pe, complete, iontorrent] --max-as-mated=INT if set, ignore mated SAM records with an alignment score (AS attribute) that exceeds this value --max-as-unmated=INT if set, ignore unmated SAM records with an alignment score (AS attribute) that exceeds this value --max-coverage=INT skip calling in sites with combined depth exceeding this fixed value (Default is 200) --max-coverage-multiplier=FLOAT skip calling in sites with combined depth exceeding multiplier * average combined coverage determined from calibration (Default is 5.0) --max-hits=INT if set, ignore SAM records with an alignment count that exceeds this value --min-mapq=INT if set, ignore SAM records with MAPQ less than this value -p, --pedigree=FILE genome relationships PED file containing sex of individual --ploidy=STRING ploidy to use when the template does not contain a reference text file (Must be one of [diploid, haploid]) (Default is diploid) --population-priors=FILE if set, use the VCF file to generate population based site-specific priors --rdefault-mated=INT for mated reads that have no mapping quality supplied use this as the default quality (in Phred format from 0 to 63) (Default is 20) --rdefault-unmated=INT for unmated reads that have no mapping quality supplied use this as the default quality (in Phred format from 0 to 63) (Default is 20) --region=STRING if set, only process SAM records within the specified range. The format is one of <template_name>, <template_name>:start-end or <template_name>:start+length --sex=SEX sex of individual (Must be one of [male, female, either]) (Default is either) Reporting -a, --all write variant calls covering every position irrespective of thresholds --avr-model=FILE name of AVR model to use when scoring variants (Must be one of [none, alternate.avr, illumina-exome.avr, illumina-wgs.avr] or a path to a model file) (Default is illumina-exome.avr) --filter-ambiguity=INT threshold for ambiguity filter applied to output variants --filter-bed=FILE apply a position based filter, retaining only variants that fall in these BED regions --filter-depth=INT apply a fixed depth of coverage filter to output variants --filter-depth-multiplier=FLOAT apply a ratio based depth filter. The filter will be multiplier * average coverage determined from calibration files --min-avr-score=FLOAT if set, fail variants with AVR scores below this value --snps-only if set, will output simple SNPs only Utility -h, --help print help on command-line flag usage -Z, --no-gzip do not gzip the output --no-index do not produce indexes for output files -T, --threads=INT number of threads (Default is the number of available cores)
rtg snp -t $RTG_INDEXES/HiSeq_UCSC_hg19 -o hg19_rtg-mapping/snp-results hg19_rtg-mapping/mapping-results/alignments.bam
rtg snp output
Failed Filters : 78 Excessive Coverage : 1651 No Hypotheses : 59 Passed Filters : 106685 Complex Called : 27697 SNPs : 80247 MNPs : 1816 Insertions : 6111 Deletions : 6884 Indels : 2607 Same as reference : 9020 SNP Transitions/Transversions: 1.91 (72552/38028) Total Het/Hom ratio : 1.74 (62083/35582) SNP Het/Hom ratio : 1.66 (50034/30213) MNP Het/Hom ratio : 0.76 (782/1034) Insertion Het/Hom ratio : 2.07 (4118/1993) Deletion Het/Hom ratio : 2.92 (5128/1756) Indel Het/Hom ratio : 3.45 (2021/586) Insertion/Deletion ratio : 0.89 (6111/6884) Indel/SNP+MNP ratio : 0.19 (15602/82063)
In addition, a region file is provided in BED format that details the full genome in regions of reference calls, no calls and variant calls which are quite unique and very valuable.
rtg region details in the first part of chr21
chr21 9464400 9464408 complex-no-variant chr21 9465822 9465832 complex-no-variant chr21 9467415 9467417 complex-called chr21 9467596 9467605 complex-called chr21 9473185 9473193 complex-called chr21 9475390 9475398 complex-called chr21 9475843 9475846 complex-no-variant chr21 9476539 9476548 complex-called chr21 9476619 9476621 complex-called chr21 9477001 9477016 complex-called chr21 9477046 9477053 complex-called chr21 9477344 9477349 complex-called chr21 9477442 9477449 complex-called chr21 9477635 9477645 complex-called chr21 9477660 9477669 complex-called chr21 9477953 9477955 complex-called chr21 9478023 9478027 complex-called chr21 9478287 9478289 complex-called chr21 9478357 9478358 complex-no-variant chr21 9478562 9478565 complex-called chr21 9478714 9478717 complex-called ... ## the data is for chr21 divided into: 19476 complex-called 57 complex-no-hypotheses 1675 complex-no-variant 5 complex-over-coverage
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d ]