NGS-formats
Understanding and mastering NGS data formats is key to any analysis. You will spend some time learning how to do this and often will need Dr Google to find back how exactly you should access your data. The code below gives a general introduction to what you will encounter.
[ Main_Page | NGS_data_analysis | Hands-on introduction to NGS variant analysis |
| NGS Exercise.1 ]
Contents
The NGS format trinity
While fastQ is still the default format to store NGS reads, BAM became the default format to store read and mapping information after alignment of a read to a reference genome. Finally, VCF (variant call format) was recently developped during the 1000 genome project, and became de facto the standard variant manipulation format. (FastQ and VCF will be detailed later today)
CRAM is at the corner of the street! This 'new' / 'improved' BAM format may be adopted for its increased compactness. Read more about CRAM here
FASTQ format
In short, fastQ provides two lines of data and a header line (that starts with '@' and can be repeated or duplicated by a line containing starting by '+'. The header contains a unique identifier (name) and important information about the sequencer origin of the current read. A correct fastQ record is a 4-line block reads as the following. The Base call quality scores are represented by ASCCI characters. If you subtract 33 from the character ascii code, you obtain the exponent of confidence score of the call for a given nucleotide (Q=10^-<ascii-33>).
A pair of reads from the paired end library looks like this (note the /1 and /2 suffix)
GAATCCAACCCTCACAAAGAAGTTTCTCAGAATTCTTCCATCGAATTTTTATGTGATGGTATTTCCTTTTTTACCATAGGCCTCAAAGCGCTCCAAATAT
+
GGGGGGGGGGGGGFGGEGGBGGGGGGFGGGEFGFGGGGFFGGGFDGGGGGGGGEGGFGCEEGFFFFGGGGEGEEBDDEE@GGEGFBEEGEEEEEEB@CDD
@EAS51_0210:3:6:3797:7459/2
AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT
+
GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC
BAM format
The BAM format no more than a binary representation of the SAM format that is more compact and allows fast searching after indexing. BAM reports the address where the reads aligned best, followed by a score illustrating the qualit of mapping, the read sequence (as in the fastQ original file, and a text string of the same length as the sequence reporting the base-call quality for each nucleotide. Finally, the BAM format reports a so called CIGAR string that describes any mismatch between the read sequence and the reference genome at the alignment address. It is not necessary to read BAM fluently but it is strongly recommended to read the SAM format specifications from type to time to discover new features and refresh your memory.
The latest specification document is maintained by Heng Li (father of samtools, bwa, and other widely used tools) and can be found at:
- the samtools home page for the latest PDF version: http://samtools.sourceforge.net [1]
- or for the latest hts-specs documentation (needs building): https://github.com/samtools/hts-specs [2]
a typical SAM/BAM header
one @HD header line describing the SAM version @SQ sequence dictionary lines describing each chromosome in the reference genome in a chosen sorting order @PG line to describe software used to generate the read group
Here is the header from today's initial data that contains only three distinct header types which is far from optimal.
samtools view -H NA18507_GAIIx_100_chr21.bam
@PG ID:CASAVA VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
@SQ SN:chr1 LN:247249719
@SQ SN:chr2 LN:242951149
@SQ SN:chr3 LN:199501827
@SQ SN:chr4 LN:191273063
@SQ SN:chr5 LN:180857866
@SQ SN:chr6 LN:170899992
@SQ SN:chr7 LN:158821424
@SQ SN:chrX LN:154913754
@SQ SN:chr8 LN:146274826
@SQ SN:chr9 LN:140273252
@SQ SN:chr10 LN:135374737
@SQ SN:chr11 LN:134452384
@SQ SN:chr12 LN:132349534
@SQ SN:chr13 LN:114142980
@SQ SN:chr14 LN:106368585
@SQ SN:chr15 LN:100338915
@SQ SN:chr16 LN:88827254
@SQ SN:chr17 LN:78774742
@SQ SN:chr18 LN:76117153
@SQ SN:chr19 LN:63811651
@SQ SN:chr20 LN:62435964
@SQ SN:chrY LN:57772954
@SQ SN:chr22 LN:49691432
@SQ SN:chr21 LN:46944323
@SQ SN:chrM LN:16571
the first 5 data lines from the same BAM file
samtools view NA18507_GAIIx_100_chr21.bam | head -5
EAS51_0210:3:6:3797:7459 89 chr21 9719702 73 100M * 0 0 ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34 SM:i:73 AS:i:0
EAS51_0210:7:33:5109:13959 145 chr21 9719707 254 100M chrY 10653706 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2 SM:i:461 AS:i:0
EAS25_0078:8:23:14907:11377 165 chr21 9719708 255 * * 0 0 CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF H0:i:1 H1:i:6 H2:i:60 SM:i:-1 AS:i:0
EAS25_0078:8:23:14907:11377 89 chr21 9719708 254 100M * 0 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3 SM:i:461 AS:i:0
the meaning of the SAM columns is explained below
1 QNAME String [!-?A-~]{1,255} Query template NAME
2 FLAG Int [0,216 -1] bitwise FLAG
3 RNAME String \*|[!-()+-<>-~][!-~]* Reference sequence NAME
4 POS Int [0,231 -1] 1-based leftmost mapping POSition
5 MAPQ Int [0,28 -1] MAPping Quality
6 CIGAR String \*|([0-9]+[MIDNSHPX=])+ CIGAR string
7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the mate/next read
8 PNEXT Int [0,231 -1] Position of the mate/next read
9 TLEN Int [-231 +1,231 -1] observed Template LENgth
10 SEQ String \*|[A-Za-z=.]+ segment SEQuence
11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33
'QNAME' is the name taken from the original fastQ record while fields 'SEQ' and 'QUAL' reproduce the sequence content of the original fastQ
- look at the 5' end of the first bam sequence and compare it to the two fastQ records above!
- look at the 3' end of the second bam sequence and compare it to the two fastQ records above!
what did you find?
Try it by yourself before expanding on the right!
- the first bam record is identical to the second fastq record (AACCTTTGTTT...)
- the second bam record is identical to the first fastQ record after reverse-complementing it (...GGTTGGATTC-3' => (5'-GAATCCAACC...). This is because of the forward-reverse library structure.
For more information about read orientation, refer to the following IGV help page: http://www.broadinstitute.org/software/igv/interpreting_pair_orientations[3]
the special FLAG field#2
This field contains aggregated binary information that can be deciphered online at http://picard.sourceforge.net/explain-flags.html[4]
BED format
BED format is often used to store coordinate based information about genes and genomic features. More informaton on this format and its relatives can be found on this wiki '.bed' or by asking Dr G.
BED files can be compressed with bgzip for rapid retrieval.
VCF format
[ The Variant Call Format (VCF) specified the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome.] (read more on the wikipedia page (http://en.wikipedia.org/wiki/Variant_Call_Format[5]).
The main tools able to manipulate VCF files in vcftools; see home page with a lot of information http://vcftools.sourceforge.net [6]
Specification information can be found in the latest content version (needs building): https://github.com/samtools/hts-specs [2]
Below are two figures from a VCFtools poster (http://vcftools.sourceforge.net/VCF-poster.pdf[7]) that show typical VCF calls in two genome samples and explain them.
Compressing NGS data, creating indexes, and extracting subregions
Although FASTA, SAM, VCF, and BED can be read using a text processor, it becomes rapidely challenging to find or extract a sequence or row out of the multitude of lines contained in these NGS files. The solution to the problem is to compress the files (often after sorting them in coordinate order) and create an 'index' that can be used for fast and random access retrieval of data.
More examples in our exercise 9
FASTA data
extract FASTA subsequences with samtools faidx
the samtools faidx command parameters
Usage: samtools faidx <file.fa|file.fa.gz> [<reg> [...]]
compress FASTA data with razip and bgzip
NGS data needs be compressed in order to reduce file size and access data without having to proceed through the whole file (tanks to the index). Several compression programs exists that have been used with NGS data. The older one is razip which tends to be now replaced by the improved bgzip, but not for all programs.
comment by Heng Li[8]
the razip command parameters
Usage: razip [options] [file] ... Options: -c write on standard output, keep original files unchanged -d decompress -l list compressed file contents -b INT decompress at INT position in the uncompressed file -s INT decompress INT bytes in the uncompressed file -h give this help
infile=ref/HiSeq_UCSC_hg19.fa razip -c ${infile} > ${infilez}.razip # after compression, the index can be created with 'samtools faidx' samtools faidx ${infilez}.razip # the index is a table of positions and lengths for each chromosome as seen here cat ${infilez}.razip.fai chrM 16571 6 70 71 chr1 249250621 16820 70 71 chr2 243199373 252828171 70 71 chr3 198022430 499501827 70 71 chr4 191154276 700353155 70 71 chr5 180915260 894238213 70 71 chr6 171115067 1077737983 70 71 chr7 159138663 1251297557 70 71 chr8 146364022 1412709636 70 71 chr9 141213431 1561164579 70 71 chr10 135534747 1704395352 70 71 chr11 135006516 1841866317 70 71 chr12 133851895 1978801505 70 71 chr13 115169878 2114565577 70 71 chr14 107349540 2231380746 70 71 chr15 102531392 2340263858 70 71 chr16 90354753 2444259992 70 71 chr17 81195210 2535905535 70 71 chr18 78077248 2618260684 70 71 chr19 59128983 2697453329 70 71 chr20 63025520 2757427019 70 71 chr21 48129895 2821352911 70 71 chr22 51304566 2870170383 70 71 chrX 155270560 2922207878 70 71 chrY 59373566 3079696595 70 71
extracting FASTA data from a local file
# point samtools to a local FASTA file indexed with samtols faidx # the file can be standard text FASTA infile=ref/HiSeq_UCSC_hg19.fa # or can have been compressed with '''razip''' or '''bgzip''' ## compression with 'zip' of 'gzip' are not supported infilez=ref/HiSeq_UCSC_hg19.fa.razip ## if it does not exist on your path, you can ## create the archived form with: ## bgzip -c ${infile} > ${infilez} time samtools faidx \ ${infile}\ chr1:1,000,000-1,001,000 \ chr2:2,000,000-2,001,000 time samtools faidx \ ${infilez}\ chr1:1,000,000-1,001,000 \ chr2:2,000,000-2,001,000 # result in both cases is the same >1:1,000,000-1,001,000 TGGGCACAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAGACCCCCT GCCAGGGAGACCCCAGGCGCCTGAATGGCCACGGGAAGGAAAACCTACCAGCCCCTCCGT GTGTCCTCCTGGCACATGGCGACCTCCATGACCCGACGAGGGTGCGGGGCCCGGGGCAGG GTGGCCAGGTGCGGGGGTGCGGGGCCCGGGGCAGCTGCCCTCGGTGGGAGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG GGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG GGGTCTGCGGGGCCCTGGGGGGGGTGGGGTCTGCGGGGCCCTGGGGGTGTTGTGGTGGGG TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGTGCCCTCGGGGGGTGTGGTGGGG TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGGGGGGCCCTAAGCTTAGATGCAGGTC TCTCCCTGGCAGCCCCTCAAGGCCACGAGGATCAGTGCTCGGAGCCTGGAGGGCTGTGTG CAGGAGTAGCAGGGCCACTGATGCCAGCGGGAAGGCCAGGCAGGGCTTCTGGGTGGAGTT CAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGC CCGATGGGAGCGCAGGGTCTGGGACTGTAGGGTCCAGCCCTACGGAGCTTAGCAGGTGTT CTCCCCGTGTGTGGAGATGAGAGATTGTAATAAATAAAGAC >2:2,000,000-2,001,000 AACCTAACCGTCTACAGTGGGGGCCGCCCTGCCTCAGTGTGGACTGTACAGAACCGAGTG TAGACAGGCCACCTTTACCTACTGAGTGAGGGCCGTCTTGCCTCAGTGTGGGCTGTACAG AACCGAGTGTAGACGGGCCACCTTTACCTAGTGAGTGAGGGCCGTCTTGCCTCAGTGTGG ACTGCACAGAACCGAGTATAGATGGGCCGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCC TCAGTGTGGGCTGCACAGAGCCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGC CGCCCTGCCTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTG AGTGAGGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTT TACCTAGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGATGGGCC GCCTTTACCTAGTGAGTGAGGGCCACCCTACCTCAGTGTGGGCTGCACAGAACCGAGTGT AGACGGGCCGCCTTTACCTAGTGAGCGAGGGCCGCCCTGCCTCAGTGTGGGCTGCACAGA ACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGCCGCCCTGCTTCAGTGTGGA CTGCATGGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCCT CAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGGGCCGCCC TGCTTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGA GGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTTTACCT AGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTT TACCTTACAGGAAGTCTTCTCTCCTAGGAACGGAGGCAGCG # adding time before the command returned: ## local uncompressed version real 0m0.002s user 0m0.001s sys 0m0.001s ## local razip-compressed version real 0m0.002s user 0m0.000s sys 0m0.002s ## the extraction time is identical ## but the file was reduced from 3Gb to 845Mb ## note that not all tools read razip-archived data ## to use compressed FASTA with such tools ## decompress the data to a stream and pipe it to the command fa_archive=some.fasta.razip gzip -dc ${fa_archive} | some-command
extracting FASTA data from a WEB file
# point samtools to a razip-compressed and indexed fasta file on the internet and ask for 2 subsequences provided as coordinates samtools faidx \ ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz \ 1:1,000,000-1,001,000 \ 2:2,000,000-2,001,000 ## the command will first download the index from the web-server to the local folder ## 'hs37d5.fa.gz.fai' (2.8K) ## then use it to retrieve subsequence(s) ## you therefore nee write access to the palce where you run the command # result >1:1,000,000-1,001,000 TGGGCACAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAGACCCCCT GCCAGGGAGACCCCAGGCGCCTGAATGGCCACGGGAAGGAAAACCTACCAGCCCCTCCGT GTGTCCTCCTGGCACATGGCGACCTCCATGACCCGACGAGGGTGCGGGGCCCGGGGCAGG GTGGCCAGGTGCGGGGGTGCGGGGCCCGGGGCAGCTGCCCTCGGTGGGAGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG GGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG GGGTCTGCGGGGCCCTGGGGGGGGTGGGGTCTGCGGGGCCCTGGGGGTGTTGTGGTGGGG TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGTGCCCTCGGGGGGTGTGGTGGGG TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGGGGGGCCCTAAGCTTAGATGCAGGTC TCTCCCTGGCAGCCCCTCAAGGCCACGAGGATCAGTGCTCGGAGCCTGGAGGGCTGTGTG CAGGAGTAGCAGGGCCACTGATGCCAGCGGGAAGGCCAGGCAGGGCTTCTGGGTGGAGTT CAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGC CCGATGGGAGCGCAGGGTCTGGGACTGTAGGGTCCAGCCCTACGGAGCTTAGCAGGTGTT CTCCCCGTGTGTGGAGATGAGAGATTGTAATAAATAAAGAC >2:2,000,000-2,001,000 AACCTAACCGTCTACAGTGGGGGCCGCCCTGCCTCAGTGTGGACTGTACAGAACCGAGTG TAGACAGGCCACCTTTACCTACTGAGTGAGGGCCGTCTTGCCTCAGTGTGGGCTGTACAG AACCGAGTGTAGACGGGCCACCTTTACCTAGTGAGTGAGGGCCGTCTTGCCTCAGTGTGG ACTGCACAGAACCGAGTATAGATGGGCCGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCC TCAGTGTGGGCTGCACAGAGCCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGC CGCCCTGCCTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTG AGTGAGGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTT TACCTAGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGATGGGCC GCCTTTACCTAGTGAGTGAGGGCCACCCTACCTCAGTGTGGGCTGCACAGAACCGAGTGT AGACGGGCCGCCTTTACCTAGTGAGCGAGGGCCGCCCTGCCTCAGTGTGGGCTGCACAGA ACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGCCGCCCTGCTTCAGTGTGGA CTGCATGGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCCT CAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGGGCCGCCC TGCTTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGA GGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTTTACCT AGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTT TACCTTACAGGAAGTCTTCTCTCCTAGGAACGGAGGCAGCG ## @adding time before the command reports: real 0m9.415s user 0m0.005s sys 0m0.007s ## this seems long as compared to the local run above !! ## but consider that you did not need to download the full data first ## which could take up to 30min with a slow internet.
SAM | BAM
Compress SAM <to> BAM
Samtools used below is not limited to the view command. Please refer to the samtools page for more information[9]
the samtools view command parameters
# SAM to BAM samtools view -bS aln.sam > aln.bam # BAM to SAM without header samtools view sample.bam > sample.sam # extract a region (but no header) samtools view sample_sorted.bam "chr1:10-13" # header also added to the output samtools view -h sample_sorted.bam "chr1:10-13" # header added and result stored in bam samtools view -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam
Tabular coordinate-based data
compress tabulat formats with bgzip
Tabular formats are frequents in NGS. They include GTF, GFF, BED, VCF, and more. Such files become often large and it is very convenient to archive them and allow rapid access through indexing.
the bgzip command parameters
Usage: bgzip [options] [file] ... Options: -c write on standard output, keep original files unchanged -d decompress -f overwrite files without asking -I FILE name of the index file [file.gz.gzi] -i compress and create BGZF index -r (re)index bgzipped file -b INT decompress at virtual (uncompressed) file pointer INT (0 based) -s INT decompress INT bytes in the uncompressed file -h give this help
Bgzip is commonly used like gzip
bgzip -c plain-text.file > archive.(b)gz
extract 'gff, bed, sam, vcf, psltbl' data with tabix
Tabix allows rapid retrieval of files indexed with it from text or bgzip archive versions.
the tabix command parameters
Program: tabix (TAB-delimited file InderXer) Version: 0.2.5 (r1005) Usage: tabix <in.tab.bgz> [region1 [region2 [...]]] Options: -p STR preset: gff, bed, sam, vcf, psltbl [gff] -s INT sequence name column [1] -b INT start column [4] -e INT end column; can be identical to '-b' [5] -S INT skip first INT lines [0] -c CHAR symbol for comment/meta lines [#] -r FILE replace the header with the content of FILE [null] -B region1 is a BED file (entire file will be read) -0 zero-based coordinate -h print also the header lines -H print only the header lines -l list chromosome names -f force to overwrite the index
extract 'custom tabular' data with tabix
For other tabular data with coordinate fields, tabix can be tuned to create an index for rapid retrieval. This can be very useful when working with large annotation files like those present in the Annovar database. The 'chromosome', 'start', and 'end' columns do not even need to appear in that order and can be designated independently as shown below.
Annovar intergenic data has format: intergenic NONE(dist=NONE),MIR3648(dist=346401) chr21 9479431 9479431 C T het 21 . with chr in column #3 start position in #4 and end position in #5
infolder="Final_Results/annovar-results" infile="common_NA18507.variant_function" outfolder="annovar-results" mkdir -p ${outfolder} # the data should be re-sorted to make sure that it is presented in chr and position order sort -k 3V,3 -k 4n,4 -k 5n,5 ${infolder}/${infile} >\ ${outfolder}/srt_${infile} bgzip -c ${outfolder}/srt_${infile} > ${outfolder}/srt_${infile}.gz ## index the resulting file # there is no header to skip or keep here, but both options exist tabix -f -S 0 -s 3 -b 4 -e 5 ${outfolder}/srt_${infile}.gz ## you can now retrieve data from the compressed & indexed file using tabix tabix ${outfolder}/srt_${infile}.gz chr21:10703925-10708925 ## resulting in intergenic TEKT4P2(dist=735331),TPTE(dist=202262) chr21 10703925 10703925 T C hom 222 . intergenic TEKT4P2(dist=735380),TPTE(dist=202213) chr21 10703974 10703974 C A het 225 . intergenic TEKT4P2(dist=739135),TPTE(dist=198458) chr21 10707729 10707729 T G het 163 . intergenic TEKT4P2(dist=739190),TPTE(dist=198403) chr21 10707784 10707784 C T het 191 . intergenic TEKT4P2(dist=739351),TPTE(dist=198242) chr21 10707945 10707945 C T het 18.1 . intergenic TEKT4P2(dist=739352),TPTE(dist=198241) chr21 10707946 10707946 A G het 225 . intergenic TEKT4P2(dist=739577),TPTE(dist=198016) chr21 10708171 10708171 A C hom 222 .
extract VCF data from the WEB (1000genome)
# as 1000g is based on B37 io hg19, we need to remove 'chr' in the address and 'hope' that coordinates are identical between builds hg19 and B37 b37_region="21:33031935-33041243" outfolder="data_extraction" mkdir -p ${outfolder} cd ${outfolder} && \ tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ${b37_region} |\ bgzip -c > SOD_1000g.vcf.gz # and index the result tabix -p vcf SOD_1000g.vcf.gz # count obtained rows zgrep -c "^21" SOD_1000g.vcf.gz # -> 71 # inspect first columns of top-9 rows zgrep -v "^##" SOD_1000g.vcf.gz | cut -f 1-6 | head # return to the 'basecamp' cd $BASE
extract from the 1000g download
#CHROM POS ID REF ALT QUAL 21 33031974 rs7277748 A G . 21 33031996 . G C . 21 33032035 . T A . 21 33032287 rs17881180 C T . 21 33032857 rs114905802 C T . 21 33032895 . C G . 21 33032896 rs17884260 T G . 21 33032988 rs6650814 C A . 21 33033001 rs4816405 C G .
References:
- ↑ http://samtools.sourceforge.net
- ↑ 2.0 2.1 https://github.com/samtools/hts-specs
- ↑ http://www.broadinstitute.org/software/igv/interpreting_pair_orientations
- ↑ http://picard.sourceforge.net/explain-flags.html
- ↑ http://en.wikipedia.org/wiki/Variant_Call_Format
- ↑ http://vcftools.sourceforge.net
- ↑ http://vcftools.sourceforge.net/VCF-poster.pdf
- ↑ http://sourceforge.net/p/samtools/mailman/message/27980791/
- ↑ http://samtools.sourceforge.net/samtools.shtml
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS Exercise.1 ]