Understanding and mastering NGS data formats is key to any analysis. You will spend some time learning how to do this and often will need Dr Google to find back how exactly you should access your data. The code below gives a general introduction to what you will encounter.

[ Main_Page | NGS_data_analysis | Hands-on introduction to NGS variant analysis |
| NGS Exercise.1 ]

The NGS format trinity

While fastQ is still the default format to store NGS reads, BAM became the default format to store read and mapping information after alignment of a read to a reference genome. Finally, VCF (variant call format) was recently developped during the 1000 genome project, and became de facto the standard variant manipulation format. (FastQ and VCF will be detailed later today)

CRAM is at the corner of the street! This 'new' / 'improved' BAM format may be adopted for its increased compactness. Read more about CRAM here

FASTQ format

In short, fastQ provides two lines of data and a header line (that starts with '@' and can be repeated or duplicated by a line containing starting by '+'. The header contains a unique identifier (name) and important information about the sequencer origin of the current read. A correct fastQ record is a 4-line block reads as the following. The Base call quality scores are represented by ASCCI characters. If you subtract 33 from the character ascii code, you obtain the exponent of confidence score of the call for a given nucleotide (Q=10^-<ascii-33>).

A pair of reads from the paired end library looks like this (note the /1 and /2 suffix)

@EAS51_0210:3:6:3797:7459/1
GAATCCAACCCTCACAAAGAAGTTTCTCAGAATTCTTCCATCGAATTTTTATGTGATGGTATTTCCTTTTTTACCATAGGCCTCAAAGCGCTCCAAATAT
+
GGGGGGGGGGGGGFGGEGGBGGGGGGFGGGEFGFGGGGFFGGGFDGGGGGGGGEGGFGCEEGFFFFGGGGEGEEBDDEE@GGEGFBEEGEEEEEEB@CDD
@EAS51_0210:3:6:3797:7459/2
AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT
+
GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC

BAM format

The BAM format no more than a binary representation of the SAM format that is more compact and allows fast searching after indexing. BAM reports the address where the reads aligned best, followed by a score illustrating the qualit of mapping, the read sequence (as in the fastQ original file, and a text string of the same length as the sequence reporting the base-call quality for each nucleotide. Finally, the BAM format reports a so called CIGAR string that describes any mismatch between the read sequence and the reference genome at the alignment address. It is not necessary to read BAM fluently but it is strongly recommended to read the SAM format specifications from type to time to discover new features and refresh your memory.

The latest specification document is maintained by Heng Li (father of samtools, bwa, and other widely used tools) and can be found at:

the samtools home page for the latest PDF version: http://samtools.sourceforge.net ^[1]
or for the latest hts-specs documentation (needs building): https://github.com/samtools/hts-specs ^[2]

a typical SAM/BAM header

one @HD header line describing the SAM version @SQ sequence dictionary lines describing each chromosome in the reference genome in a chosen sorting order @PG line to describe software used to generate the read group

Here is the header from today's initial data that contains only three distinct header types which is far from optimal.

samtools view -H NA18507_GAIIx_100_chr21.bam

@HD VN:1.3 SO:coordinate
@PG ID:CASAVA VN:CASAVA-1.7.0 CL:/home/csaunders/devel/CASAVA_20091209/install_main/bin/run.pl -p . --targets bam --bamWholeGenome --bamChangeChromLabels=UCSC -sa --jobsLimit=16
@SQ SN:chr1 LN:247249719
@SQ SN:chr2 LN:242951149
@SQ SN:chr3 LN:199501827
@SQ SN:chr4 LN:191273063
@SQ SN:chr5 LN:180857866
@SQ SN:chr6 LN:170899992
@SQ SN:chr7 LN:158821424
@SQ SN:chrX LN:154913754
@SQ SN:chr8 LN:146274826
@SQ SN:chr9 LN:140273252
@SQ SN:chr10 LN:135374737
@SQ SN:chr11 LN:134452384
@SQ SN:chr12 LN:132349534
@SQ SN:chr13 LN:114142980
@SQ SN:chr14 LN:106368585
@SQ SN:chr15 LN:100338915
@SQ SN:chr16 LN:88827254
@SQ SN:chr17 LN:78774742
@SQ SN:chr18 LN:76117153
@SQ SN:chr19 LN:63811651
@SQ SN:chr20 LN:62435964
@SQ SN:chrY LN:57772954
@SQ SN:chr22 LN:49691432
@SQ SN:chr21 LN:46944323
@SQ SN:chrM LN:16571

the first 5 data lines from the same BAM file

samtools view NA18507_GAIIx_100_chr21.bam | head -5

EAS51_0210:3:6:3797:7459 165 chr21 9719702 255 * * 0 0 AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC H0:i:0 H1:i:0 H2:i:2 SM:i:-1 AS:i:0
EAS51_0210:3:6:3797:7459 89 chr21 9719702 73 100M * 0 0 ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34 SM:i:73 AS:i:0
EAS51_0210:7:33:5109:13959 145 chr21 9719707 254 100M chrY 10653706 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2 SM:i:461 AS:i:0
EAS25_0078:8:23:14907:11377 165 chr21 9719708 255 * * 0 0 CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF H0:i:1 H1:i:6 H2:i:60 SM:i:-1 AS:i:0
EAS25_0078:8:23:14907:11377 89 chr21 9719708 254 100M * 0 0 TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3 SM:i:461 AS:i:0

the meaning of the SAM columns is explained below

Col Field Type Regexp/Range Brief description
1 QNAME String [!-?A-~]{1,255} Query template NAME
2 FLAG Int [0,216 -1] bitwise FLAG
3 RNAME String \*|[!-()+-<>-~][!-~]* Reference sequence NAME
4 POS Int [0,231 -1] 1-based leftmost mapping POSition
5 MAPQ Int [0,28 -1] MAPping Quality
6 CIGAR String \*|([0-9]+[MIDNSHPX=])+ CIGAR string
7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the mate/next read
8 PNEXT Int [0,231 -1] Position of the mate/next read
9 TLEN Int [-231 +1,231 -1] observed Template LENgth
10 SEQ String \*|[A-Za-z=.]+ segment SEQuence
11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33

'QNAME' is the name taken from the original fastQ record while fields 'SEQ' and 'QUAL' reproduce the sequence content of the original fastQ

look at the 5' end of the first bam sequence and compare it to the two fastQ records above!
look at the 3' end of the second bam sequence and compare it to the two fastQ records above!

what did you find?

Try it by yourself before expanding on the right!

the first bam record is identical to the second fastq record (AACCTTTGTTT...)
the second bam record is identical to the first fastQ record after reverse-complementing it (...GGTTGGATTC-3' => (5'-GAATCCAACC...). This is because of the forward-reverse library structure.

For more information about read orientation, refer to the following IGV help page: http://www.broadinstitute.org/software/igv/interpreting_pair_orientations^[3]

the special FLAG field#2

This field contains aggregated binary information that can be deciphered online at http://picard.sourceforge.net/explain-flags.html^[4]

BED format

BED format is often used to store coordinate based information about genes and genomic features. More informaton on this format and its relatives can be found on this wiki '.bed' or by asking Dr G.

BED files can be compressed with bgzip for rapid retrieval.

VCF format

[ The Variant Call Format (VCF) specified the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome.] (read more on the wikipedia page (http://en.wikipedia.org/wiki/Variant_Call_Format^[5]).

The main tools able to manipulate VCF files in vcftools; see home page with a lot of information http://vcftools.sourceforge.net ^[6]

Specification information can be found in the latest content version (needs building): https://github.com/samtools/hts-specs ^[2]

Below are two figures from a VCFtools poster (http://vcftools.sourceforge.net/VCF-poster.pdf^[7]) that show typical VCF calls in two genome samples and explain them.

Compressing NGS data, creating indexes, and extracting subregions

Although FASTA, SAM, VCF, and BED can be read using a text processor, it becomes rapidely challenging to find or extract a sequence or row out of the multitude of lines contained in these NGS files. The solution to the problem is to compress the files (often after sorting them in coordinate order) and create an 'index' that can be used for fast and random access retrieval of data.

More examples in our exercise 9

FASTA data

extract FASTA subsequences with samtools faidx

the samtools faidx command parameters

Usage:   samtools faidx <file.fa|file.fa.gz> [<reg> [...]]

compress FASTA data with razip and bgzip

NGS data needs be compressed in order to reduce file size and access data without having to proceed through the whole file (tanks to the index). Several compression programs exists that have been used with NGS data. The older one is razip which tends to be now replaced by the improved bgzip, but not for all programs.

comment by Heng Li^[8]

Picard does not support random access in a razip'ed file and it is quite complicated to implement the razip method in Java as razip has to access some low-level zlib APIs. Given the substantial amount of work, probably razip will never get implemented in Java, which means Picard will never read razip'ed fasta files. The key difference between razip and bgzip is that razip uses the real file offset while bgzip uses the virtual file offset. Razip is slightly older than bgzip. The very first version of samtools (not public) used razip instead of bgzip. We switched to bgzip/BGZF because it is simpler and much easier to reimplement in Java. It is actually possible to extend BGZF to make it support seeking with the real file offset, but this has not been done.

the razip command parameters

Usage:   razip [options] [file] ...
 
Options: -c      write on standard output, keep original files unchanged
         -d      decompress
         -l      list compressed file contents
         -b INT  decompress at INT position in the uncompressed file
         -s INT  decompress INT bytes in the uncompressed file
         -h      give this help

infile=ref/HiSeq_UCSC_hg19.fa
 
razip -c ${infile} > ${infilez}.razip
 
# after compression, the index can be created with 'samtools faidx'
samtools faidx ${infilez}.razip
 
# the index is a table of positions and lengths for each chromosome as seen here
cat ${infilez}.razip.fai
 
chrM	16571	6	70	71
chr1	249250621	16820	70	71
chr2	243199373	252828171	70	71
chr3	198022430	499501827	70	71
chr4	191154276	700353155	70	71
chr5	180915260	894238213	70	71
chr6	171115067	1077737983	70	71
chr7	159138663	1251297557	70	71
chr8	146364022	1412709636	70	71
chr9	141213431	1561164579	70	71
chr10	135534747	1704395352	70	71
chr11	135006516	1841866317	70	71
chr12	133851895	1978801505	70	71
chr13	115169878	2114565577	70	71
chr14	107349540	2231380746	70	71
chr15	102531392	2340263858	70	71
chr16	90354753	2444259992	70	71
chr17	81195210	2535905535	70	71
chr18	78077248	2618260684	70	71
chr19	59128983	2697453329	70	71
chr20	63025520	2757427019	70	71
chr21	48129895	2821352911	70	71
chr22	51304566	2870170383	70	71
chrX	155270560	2922207878	70	71
chrY	59373566	3079696595	70	71

extracting FASTA data from a local file

# point samtools to a local FASTA file indexed with samtols faidx
 
# the file can be standard text FASTA
infile=ref/HiSeq_UCSC_hg19.fa
 
# or can have been compressed with '''razip''' or '''bgzip'''
## compression with 'zip' of 'gzip' are not supported
infilez=ref/HiSeq_UCSC_hg19.fa.razip
 
## if it does not exist on your path, you can
## create the archived form with:
## bgzip -c ${infile} > ${infilez}
 
time samtools faidx \
    ${infile}\
    chr1:1,000,000-1,001,000 \
    chr2:2,000,000-2,001,000
 
time samtools faidx \
    ${infilez}\
    chr1:1,000,000-1,001,000 \
    chr2:2,000,000-2,001,000
 
# result in both cases is the same
>1:1,000,000-1,001,000
TGGGCACAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAGACCCCCT
GCCAGGGAGACCCCAGGCGCCTGAATGGCCACGGGAAGGAAAACCTACCAGCCCCTCCGT
GTGTCCTCCTGGCACATGGCGACCTCCATGACCCGACGAGGGTGCGGGGCCCGGGGCAGG
GTGGCCAGGTGCGGGGGTGCGGGGCCCGGGGCAGCTGCCCTCGGTGGGAGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGGTGGGGTCTGCGGGGCCCTGGGGGTGTTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGTGCCCTCGGGGGGTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGGGGGGCCCTAAGCTTAGATGCAGGTC
TCTCCCTGGCAGCCCCTCAAGGCCACGAGGATCAGTGCTCGGAGCCTGGAGGGCTGTGTG
CAGGAGTAGCAGGGCCACTGATGCCAGCGGGAAGGCCAGGCAGGGCTTCTGGGTGGAGTT
CAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGC
CCGATGGGAGCGCAGGGTCTGGGACTGTAGGGTCCAGCCCTACGGAGCTTAGCAGGTGTT
CTCCCCGTGTGTGGAGATGAGAGATTGTAATAAATAAAGAC
>2:2,000,000-2,001,000
AACCTAACCGTCTACAGTGGGGGCCGCCCTGCCTCAGTGTGGACTGTACAGAACCGAGTG
TAGACAGGCCACCTTTACCTACTGAGTGAGGGCCGTCTTGCCTCAGTGTGGGCTGTACAG
AACCGAGTGTAGACGGGCCACCTTTACCTAGTGAGTGAGGGCCGTCTTGCCTCAGTGTGG
ACTGCACAGAACCGAGTATAGATGGGCCGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCC
TCAGTGTGGGCTGCACAGAGCCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGC
CGCCCTGCCTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTG
AGTGAGGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTAGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGATGGGCC
GCCTTTACCTAGTGAGTGAGGGCCACCCTACCTCAGTGTGGGCTGCACAGAACCGAGTGT
AGACGGGCCGCCTTTACCTAGTGAGCGAGGGCCGCCCTGCCTCAGTGTGGGCTGCACAGA
ACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGCCGCCCTGCTTCAGTGTGGA
CTGCATGGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCCT
CAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGGGCCGCCC
TGCTTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGA
GGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTTTACCT
AGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTTACAGGAAGTCTTCTCTCCTAGGAACGGAGGCAGCG
 
# adding time before the command returned:
## local uncompressed version
real	0m0.002s
user	0m0.001s
sys	0m0.001s
 
## local razip-compressed version
real	0m0.002s
user	0m0.000s
sys	0m0.002s
 
## the extraction time is identical 
## but the file was reduced from 3Gb to 845Mb
 
## note that not all tools read razip-archived data
## to use compressed FASTA with such tools
## decompress the data to a stream and pipe it to the command
 
fa_archive=some.fasta.razip
gzip -dc ${fa_archive} | some-command

extracting FASTA data from a WEB file

# point samtools to a razip-compressed and indexed fasta file on the internet and ask for 2 subsequences provided as coordinates
samtools faidx \
    ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz \
    1:1,000,000-1,001,000 \
    2:2,000,000-2,001,000
 
## the command will first download the index from the web-server to the local folder
## 'hs37d5.fa.gz.fai' (2.8K)
## then use it to retrieve subsequence(s)
 
## you therefore nee write access to the palce where you run the command
 
# result
>1:1,000,000-1,001,000
TGGGCACAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAGACCCCCT
GCCAGGGAGACCCCAGGCGCCTGAATGGCCACGGGAAGGAAAACCTACCAGCCCCTCCGT
GTGTCCTCCTGGCACATGGCGACCTCCATGACCCGACGAGGGTGCGGGGCCCGGGGCAGG
GTGGCCAGGTGCGGGGGTGCGGGGCCCGGGGCAGCTGCCCTCGGTGGGAGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
TGGTCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGGGCCCTGGGGGGGTGTGGTG
GGGTCTGCGGGGCCCTGGGGGGGGTGGGGTCTGCGGGGCCCTGGGGGTGTTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGCGGTGCCCTCGGGGGGTGTGGTGGGG
TCTGCGGGGCCCTGGGGGGGTGTGGTGGGGTCTGGGGGGCCCTAAGCTTAGATGCAGGTC
TCTCCCTGGCAGCCCCTCAAGGCCACGAGGATCAGTGCTCGGAGCCTGGAGGGCTGTGTG
CAGGAGTAGCAGGGCCACTGATGCCAGCGGGAAGGCCAGGCAGGGCTTCTGGGTGGAGTT
CAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGC
CCGATGGGAGCGCAGGGTCTGGGACTGTAGGGTCCAGCCCTACGGAGCTTAGCAGGTGTT
CTCCCCGTGTGTGGAGATGAGAGATTGTAATAAATAAAGAC
>2:2,000,000-2,001,000
AACCTAACCGTCTACAGTGGGGGCCGCCCTGCCTCAGTGTGGACTGTACAGAACCGAGTG
TAGACAGGCCACCTTTACCTACTGAGTGAGGGCCGTCTTGCCTCAGTGTGGGCTGTACAG
AACCGAGTGTAGACGGGCCACCTTTACCTAGTGAGTGAGGGCCGTCTTGCCTCAGTGTGG
ACTGCACAGAACCGAGTATAGATGGGCCGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCC
TCAGTGTGGGCTGCACAGAGCCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGC
CGCCCTGCCTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTG
AGTGAGGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTAGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGATGGGCC
GCCTTTACCTAGTGAGTGAGGGCCACCCTACCTCAGTGTGGGCTGCACAGAACCGAGTGT
AGACGGGCCGCCTTTACCTAGTGAGCGAGGGCCGCCCTGCCTCAGTGTGGGCTGCACAGA
ACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGTGAGGGCCGCCCTGCTTCAGTGTGGA
CTGCATGGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGAGGGCTGCCCTGCCT
CAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTTTACCTAGTGAGGGCCGCCC
TGCTTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCTGCCTTTACCTAGTGAGTGA
GGGCCGCCCTGCCTCAGTGTGGGCTGTACAGAACCGAGTGTAGACGGGCCGCCTTTACCT
AGTGAGGGCCACCCTGCTTCAGTGTGGGCTGCACAGAACCGAGTGTAGACGGGCCGCCTT
TACCTTACAGGAAGTCTTCTCTCCTAGGAACGGAGGCAGCG
 
## @adding time before the command reports:
real	0m9.415s
user	0m0.005s
sys	0m0.007s
 
## this seems long as compared to the local run above !!
## but consider that you did not need to download the full data first 
## which could take up to 30min with a slow internet.

SAM | BAM

Compress SAM <to> BAM

Samtools used below is not limited to the view command. Please refer to the samtools page for more information^[9]

the samtools view command parameters

Usage: samtools view [options] <in.bam>

# SAM to BAM
samtools view -bS aln.sam > aln.bam 
 
# BAM to SAM without header
samtools view sample.bam > sample.sam
 
# extract a region (but no header)
samtools view sample_sorted.bam "chr1:10-13"
 
# header also added to the output
samtools view -h sample_sorted.bam "chr1:10-13"
 
# header added and result stored in bam
samtools view -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam

Tabular coordinate-based data

compress tabulat formats with bgzip

Tabular formats are frequents in NGS. They include GTF, GFF, BED, VCF, and more. Such files become often large and it is very convenient to archive them and allow rapid access through indexing.

the bgzip command parameters

Usage:   bgzip [options] [file] ...
 
Options: -c        write on standard output, keep original files unchanged
         -d        decompress
         -f        overwrite files without asking
         -I FILE   name of the index file [file.gz.gzi]
         -i        compress and create BGZF index
         -r        (re)index bgzipped file
         -b INT    decompress at virtual (uncompressed) file pointer INT (0 based)
         -s INT    decompress INT bytes in the uncompressed file
         -h        give this help

Bgzip is commonly used like gzip

bgzip -c plain-text.file > archive.(b)gz

extract 'gff, bed, sam, vcf, psltbl' data with tabix

Tabix allows rapid retrieval of files indexed with it from text or bgzip archive versions.

the tabix command parameters

Program: tabix (TAB-delimited file InderXer)
Version: 0.2.5 (r1005)
 
Usage:   tabix <in.tab.bgz> [region1 [region2 [...]]]
 
Options: -p STR     preset: gff, bed, sam, vcf, psltbl [gff]
         -s INT     sequence name column [1]
         -b INT     start column [4]
         -e INT     end column; can be identical to '-b' [5]
         -S INT     skip first INT lines [0]
         -c CHAR    symbol for comment/meta lines [#]
         -r FILE    replace the header with the content of FILE [null]
         -B         region1 is a BED file (entire file will be read)
         -0         zero-based coordinate
         -h         print also the header lines
         -H         print only the header lines
         -l         list chromosome names
         -f         force to overwrite the index

Note the presence of a parameter '-0' for 'zero-based coordinate'

extract 'custom tabular' data with tabix

For other tabular data with coordinate fields, tabix can be tuned to create an index for rapid retrieval. This can be very useful when working with large annotation files like those present in the Annovar database. The 'chromosome', 'start', and 'end' columns do not even need to appear in that order and can be designated independently as shown below.

Annovar intergenic data has format: intergenic NONE(dist=NONE),MIR3648(dist=346401) chr21 9479431 9479431 C T het 21 . with chr in column #3 start position in #4 and end position in #5

infolder="Final_Results/annovar-results"
infile="common_NA18507.variant_function"
 
outfolder="annovar-results"
mkdir -p ${outfolder}
 
# the data should be re-sorted to make sure that it is presented in chr and position order
sort -k 3V,3 -k 4n,4 -k 5n,5 ${infolder}/${infile} >\
    ${outfolder}/srt_${infile}
 
bgzip -c ${outfolder}/srt_${infile} > ${outfolder}/srt_${infile}.gz
 
## index the resulting file
# there is no header to skip or keep here, but both options exist
tabix -f -S 0 -s 3 -b 4 -e 5 ${outfolder}/srt_${infile}.gz
 
## you can now retrieve data from the compressed & indexed file using tabix
tabix ${outfolder}/srt_${infile}.gz chr21:10703925-10708925
 
## resulting in
intergenic	TEKT4P2(dist=735331),TPTE(dist=202262)	chr21	10703925	10703925	T	C	hom	222	.
intergenic	TEKT4P2(dist=735380),TPTE(dist=202213)	chr21	10703974	10703974	C	A	het	225	.
intergenic	TEKT4P2(dist=739135),TPTE(dist=198458)	chr21	10707729	10707729	T	G	het	163	.
intergenic	TEKT4P2(dist=739190),TPTE(dist=198403)	chr21	10707784	10707784	C	T	het	191	.
intergenic	TEKT4P2(dist=739351),TPTE(dist=198242)	chr21	10707945	10707945	C	T	het	18.1	.
intergenic	TEKT4P2(dist=739352),TPTE(dist=198241)	chr21	10707946	10707946	A	G	het	225	.
intergenic	TEKT4P2(dist=739577),TPTE(dist=198016)	chr21	10708171	10708171	A	C	hom	222	.

extract VCF data from the WEB (1000genome)

# as 1000g is based on B37 io hg19, we need to remove 'chr' in the address and 'hope' that coordinates are identical between builds hg19 and B37
b37_region="21:33031935-33041243"
outfolder="data_extraction"
 
mkdir -p ${outfolder}
 
cd ${outfolder} && \
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ${b37_region} |\
    bgzip -c  > SOD_1000g.vcf.gz
 
# and index the result
tabix -p vcf SOD_1000g.vcf.gz
 
# count obtained rows
zgrep -c "^21" SOD_1000g.vcf.gz
# -> 71
 
# inspect first columns of top-9 rows
zgrep -v "^##" SOD_1000g.vcf.gz | cut -f 1-6 | head
 
# return to the 'basecamp'
cd $BASE

extract from the 1000g download

#CHROM	POS	ID	REF	ALT	QUAL
21	33031974	rs7277748	A	G	.
21	33031996	.	G	C	.
21	33032035	.	T	A	.
21	33032287	rs17881180	C	T	.
21	33032857	rs114905802	C	T	.
21	33032895	.	C	G	.
21	33032896	rs17884260	T	G	.
21	33032988	rs6650814	C	A	.
21	33033001	rs4816405	C	G	.

References:

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS Exercise.1 ]

[1] ttp://samtools.sourceforge.net

[hts-specs-2] 2.0 ^2.1 https://github.com/samtools/hts-specs

[3] ttp://www.broadinstitute.org/software/igv/interpreting_pair_orientations

[4] ttp://picard.sourceforge.net/explain-flags.html

[5] ttp://en.wikipedia.org/wiki/Variant_Call_Format

[6] ttp://vcftools.sourceforge.net

[7] ttp://vcftools.sourceforge.net/VCF-poster.pdf

[8] ttp://sourceforge.net/p/samtools/mailman/message/27980791/

[9] ttp://samtools.sourceforge.net/samtools.shtml

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

NGS-formats

Contents

The NGS format trinity

FASTQ format

BAM format

a typical SAM/BAM header

the first 5 data lines from the same BAM file

the meaning of the SAM columns is explained below

the special FLAG field#2

BED format

VCF format

Compressing NGS data, creating indexes, and extracting subregions

FASTA data

extract FASTA subsequences with samtools faidx

compress FASTA data with razip and bgzip

extracting FASTA data from a local file

extracting FASTA data from a WEB file

SAM | BAM

Compress SAM <to> BAM

Tabular coordinate-based data

compress tabulat formats with bgzip

extract 'gff, bed, sam, vcf, psltbl' data with tabix

extract 'custom tabular' data with tabix

extract VCF data from the WEB (1000genome)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox