NGS Exercise.9
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.8 ]
# updated 2014 version
Extract SOD1 locus region data from different NGS data files
When dealing with NGS data, file size can rapidly become an issue when users need to look at a specific position to review results. Fortunately, there exist tools and indexing methods that render file extraction very fast and relatively easy.
Some of the commands unveiled here also work over Internet with file URLs as input and can download pieces from large hosted files to mine a small locus
Contents
Manipulate FASTA data
Sort, compress, and index FASTA
Samtools faidx is used to index FASTA or FASTA.gz data and rapidly retrieve sequences by coordinate.
infolder="ref" infile="HiSeq_UCSC_hg19.fa" # index the reference genome file (if not done already) samtools faidx ${infolder}/${infile}
retrieve a FASTA subsequence by its coordinates
# SOD1 gene # chr21:33031935-33041243 # id = NM_000454 # first 100bps on the 5' side region="chr21:33031935-33032035" samtools faidx ${infolder}/${infile} ${region} >chr21:33031935-33032035 GTTTGGGGCCAGAGTGGGCGAGGCGCGGAGGTCTGGCCTATAAAGTAGTCGCGGAGACGG GGTGCTGGTTTGCGTCGTAGTCTCCTGCAGCGTCTGGGGTT
Batch retrieve sequences from a FASTA and a BED file
It is often very handy to be able to retrieve a large number of sequences based on a list of loci (variations, or any other features generated with BEDTools. Bedtools offers a very nice command for that.
get sequence for multiple intervals
Tool: bedtools getfasta (aka fastaFromBed) Version: v2.17.0-125-g8dcd62a Summary: Extract DNA sequences into a fasta file based on feature coordinates. Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> Options: -fi Input FASTA file -bed BED/GFF/VCF file of ranges to extract from -fi -fo Output file (can be FASTA or TAB-delimited) -name Use the name field for the FASTA header -split given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons) -tab Write output in TAB delimited format. - Default is FASTA format. -s Force strandedness. If the feature occupies the antisense, strand, the sequence will be reverse complemented. - By default, strand information is ignored.
infolder="ref" infasta="HiSeq_UCSC_hg19.fa" fullbed="hg19_refGene.bed" inbed="hg19_refGene_chr21.bed" outfolder="data_extraction" mkdir -p ${outfolder} # create chr21 subset of refgene grep "^chr21" ${infolder}/${fullbed} > ${outfolder}/${inbed} # get the full chr21 genes bedtools getfasta -fi ${infolder}/${infasta} \ -bed ${outfolder}/${inbed} \ -fo ${outfolder}/${inbed%%.bed}_genomic.fa \ -name # get the splices transcripts reversed-comlemented for anti-sense bedtools getfasta -fi ${infolder}/${infasta} \ -bed ${outfolder}/${inbed} \ -fo ${outfolder}/${inbed%%.bed}_spliced.fa \ -split \ -s
refgene on chr21
# genomic regions on '+' strand >NR_037421 CGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGATCGCCGAGGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCC GCCGGCGGCGGTGAGGCCCCGCGCGTGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTCCC CGCCGGCCGCCTTTCTCGCG >NR_037458 CGCGCGTGCGCCCGAGCGCGGCCCGGTGGTCCCTCCCGGACAGGCGTTCGTGCGACGTGTG >NR_038327 TGCAGAAAGCTAACGCACTGTTTATTTGGGGGATTGGGGGGAAGCACCGTGCCGCTGCTCACTGGTAGCCAGCCAGCTGC AGAATGGTGGGGTAGCAAGTACGATGGGCCATGCACTTCTGGCGGTCGATGAAGAGACTGTTGGTCATGGCAGTGACGTC CTTCTCCAGGCTCATGTGGATGTCCTTGAGGTTGCGCAGGGACTGCTCCGCTTGTAGAAGCTTCTCCCGCAGCGCTGTGA TGGACTTGTACAGCTCCTCCACCTCACTCACCAGCTTGGGGGTGTTGGGGGGTGTGAGCTGGGGCTGGGGAGGGCAGAAG TATGCACCTACTGGGGTGGAGGGGACCCAAAACTCCCAATGGGAGCTGGCAGGAGGTCCTGGGAAGACGCCATGAAAGGA TCCTACCAGGAAAGCGGCTCTAGGGCAGAGCATAAATTACGAGGGTCCTCCCAGGGAGCGCGGGCCTGAGTGGGAATGAG TGACCCGCGTGCAATCTCGACCCTGACAGGACAGGACTGGCCTCAGCCGACCAGGCTCAGTTCTTTCCATTCCTGATATT TGACGGAAGGAGGCACCCAGTTCCTTGAAGGAACTGAGGGGCAGGGAAGGAAGAAAGGTACTTAGGCTCAGAAGGGGCCA CCAGGCATCCGTTAACAAGGGAAACAAAAGGACGGCCCATCTATGCCTGTAGCCCAGGCGTAGGTCATACTTTCCACCAG AGGTGGGGACAGCACCCTCAGGGCAGCAGGGAATGGCCCTCTCTGGCCTCTCTGGCCCCGTGGCCTCCAGGAGCTCACCT TGTCTGAAGGAGCTTCCTCCCGGGAAGATGAGGTAGCACAGGGTGGGGTGAGACCACGTGGGCACAGGTCCCTGGCACGG GGAGGCCCCCGACCCATGACCTGCCCAGTGTCATGTCTATATCTGTGTGTTTAAGGCGTGCTTGCATGTTGTGTGTGGCC GCAGAGTCTCCCTCCCCTCTCAGCACCTGTGGGGGATGTGTGTGTGGAGATTGGAGTGTTTATTGGAGAGGGTCGCAGCG AGGAGGTGGATGGGGGCACCTGGGACTGCCTCAGAGGCCCAGGCAGTACCTGAACTGGGCTGCATCGCGGCACAGCTGCA TGTTGGGCCGGTGTGAGAGCAACTACAGCCGGGTCTGGGCTATGTGCAGAGGTTCCTCCTTGTCCTTGATGGCCTCCTTC ...... # transcripts after splicing >chr21:9825831-9826011(+) CGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGATCGCCGAGGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCC GCCGGCGGCGGTGAGGCCCCGCGCGTGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTCCC CGCCGGCCGCCTTTCTCGCG >chr21:9826202-9826263(+) CGCGCGTGCGCCCGAGCGCGGCCCGGTGGTCCCTCCCGGACAGGCGTTCGTGCGACGTGTG >chr21:9907188-9968593(-) ATTTTCCTCCGGAAGTGCGGATCCCAGCGGCGGTCGTGTAGCTGAGCAGGCCTGGGGCTTGGTTCTATGTCCCTGTAGCT ATGTTTCCAGTGTCCTCTGGGTGTTTCCAAGAGCAACAAGAAACGAATAAATCTCTGCCCCGCAGCGCCTCCACCCCAGA GACCCGGACCAAGTTCACACAGGACAATCTGTGCCACGCCCAGCGCGAGCGCCTGGACTCGGCCAACCTGTGGGTGCTGG TGGACTGCATCCTTCGCGACACCTCCGAGGACCTGGGACTCCAGTGTGACGCCGTGAACCTGGCCTTCGGGCGCCGCTGT GAGGAACTGGAGGACGCGCGGCACAAGCTGCAGCACCACCTGCACAAGATGCTGCGGGAAATCACAGATCAGGAACACAA CGTGGTGGCACTGAAGGAGGCCATCAAGGACAAGGAGGAACCTCTGCACATAGCCCAGACCCGGCTGTAGTTGCTCTCAC ACCGGCCCAACATGCAGCTGTGCCGCGATGCAGCCCAGTTCAGGTACTGCCTGGGCCTCTGAGGCAGTCCCAGGTGCCCC CATCCACCTCCTCGCTGCGACCCTCTCCAATAAACACTCCAATCTCCACACACACATCCCCCACAGGTGCTGAGAGGGGA GGGAGACTCTGCGGCCACACACAACATGCAAGCACGCCTTAAACACACAGATATAGACATGACACTGGGCAGGTCATGGG TCGGGGGCCTCCCCGTGCCAGGGACCTGTGCCCACGTGGTCTCACCCCACCCTGTGCTACCTCATCTTCCCGGGAGGAAG CTCCTTCAGACAAGGTGAGCTCCTGGAGGCCACGGGGCCAGAGAGGCCAGAGAGGGCCATTCCCTGCTGCCCTGAGGGTG CTGTCCCCACCTCTGGTGGAAAGTATGACCTACGCCTGGGCTACAGGCATAGATGGGCCGTCCTTTTGTTTCCCTTGTTA ACGGATGCCTGGTGGCCCCTTCTGAGCCTAAGTACCTTTCTTCCTTCCCTGCCCCTCAGTTCCTTCAAGGAACTGGGTGC CTCCTTCCGTCAAATATCAGGAATGGAAAGAACTGAGCCTGGTCGGCTGAGGCCAGTCCTGTCCTGTCAGGGTCGAGATT GCACGCGGGTCACTCATTCCCACTCAGGCCCGCGCTCCCTGGGAGGACCCTCGTAATTTATGCTCTGCCCTAGAGCCGCT TTCCTGGTAGGATCCTTTCATGGCGTCTTCCCAGGACCTCCTGCCAGCTCCCATTGGGAGTTTTGGGTCCCCTCCACCCC AGTAGGTGCATACTTCTGCCCTCCCCAGCCCCAGCTCACACCCCCCAACACCCCCAAGCTGGTGAGTGAGGTGGAGGAGC TGTACAAGTCCATCACAGCGCTGCGGGAGAAGCTTCTACAAGCGGAGCAGTCCCTGCGCAACCTCAAGGACATCCACATG AGCCTGGAGAAGGACGTCACTGCCATGACCAACAGTCTCTTCATCGACCGCCAGAAGTGCATGGCCCATCGTACTTGCTA CCCCACCATTCTGCAGCTGGCTGGCTACCAGTGAGCAGCGGCACGGTGCTTCCCCCCAATCCCCCAAATAAACAGTGCGT TAGCTTTCTGCA # transcripts with '-name' added to the command >NR_037421 CGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGATCGCCGAGGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCC GCCGGCGGCGGTGAGGCCCCGCGCGTGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTCCC CGCCGGCCGCCTTTCTCGCG >NR_037458 CGCGCGTGCGCCCGAGCGCGGCCCGGTGGTCCCTCCCGGACAGGCGTTCGTGCGACGTGTG >NR_038327 ATTTTCCTCCGGAAGTGCGGATCCCAGCGGCGGTCGTGTAGCTGAGCAGGCCTGGGGCTTGGTTCTATGTCCCTGTAGCT ATGTTTCCAGTGTCCTCTGGGTGTTTCCAAGAGCAACAAGAAACGAATAAATCTCTGCCCCGCAGCGCCTCCACCCCAGA GACCCGGACCAAGTTCACACAGGACAATCTGTGCCACGCCCAGCGCGAGCGCCTGGACTCGGCCAACCTGTGGGTGCTGG TGGACTGCATCCTTCGCGACACCTCCGAGGACCTGGGACTCCAGTGTGACGCCGTGAACCTGGCCTTCGGGCGCCGCTGT GAGGAACTGGAGGACGCGCGGCACAAGCTGCAGCACCACCTGCACAAGATGCTGCGGGAAATCACAGATCAGGAACACAA CGTGGTGGCACTGAAGGAGGCCATCAAGGACAAGGAGGAACCTCTGCACATAGCCCAGACCCGGCTGTAGTTGCTCTCAC ACCGGCCCAACATGCAGCTGTGCCGCGATGCAGCCCAGTTCAGGTACTGCCTGGGCCTCTGAGGCAGTCCCAGGTGCCCC CATCCACCTCCTCGCTGCGACCCTCTCCAATAAACACTCCAATCTCCACACACACATCCCCCACAGGTGCTGAGAGGGGA GGGAGACTCTGCGGCCACACACAACATGCAAGCACGCCTTAAACACACAGATATAGACATGACACTGGGCAGGTCATGGG TCGGGGGCCTCCCCGTGCCAGGGACCTGTGCCCACGTGGTCTCACCCCACCCTGTGCTACCTCATCTTCCCGGGAGGAAG CTCCTTCAGACAAGGTGAGCTCCTGGAGGCCACGGGGCCAGAGAGGCCAGAGAGGGCCATTCCCTGCTGCCCTGAGGGTG CTGTCCCCACCTCTGGTGGAAAGTATGACCTACGCCTGGGCTACAGGCATAGATGGGCCGTCCTTTTGTTTCCCTTGTTA ACGGATGCCTGGTGGCCCCTTCTGAGCCTAAGTACCTTTCTTCCTTCCCTGCCCCTCAGTTCCTTCAAGGAACTGGGTGC CTCCTTCCGTCAAATATCAGGAATGGAAAGAACTGAGCCTGGTCGGCTGAGGCCAGTCCTGTCCTGTCAGGGTCGAGATT GCACGCGGGTCACTCATTCCCACTCAGGCCCGCGCTCCCTGGGAGGACCCTCGTAATTTATGCTCTGCCCTAGAGCCGCT TTCCTGGTAGGATCCTTTCATGGCGTCTTCCCAGGACCTCCTGCCAGCTCCCATTGGGAGTTTTGGGTCCCCTCCACCCC AGTAGGTGCATACTTCTGCCCTCCCCAGCCCCAGCTCACACCCCCCAACACCCCCAAGCTGGTGAGTGAGGTGGAGGAGC TGTACAAGTCCATCACAGCGCTGCGGGAGAAGCTTCTACAAGCGGAGCAGTCCCTGCGCAACCTCAAGGACATCCACATG AGCCTGGAGAAGGACGTCACTGCCATGACCAACAGTCTCTTCATCGACCGCCAGAAGTGCATGGCCCATCGTACTTGCTA CCCCACCATTCTGCAGCTGGCTGGCTACCAGTGAGCAGCGGCACGGTGCTTCCCCCCAATCCCCCAAATAAACAGTGCGT TAGCTTTCTGCA
Note that although the coordinates come from GB records, the sequence is taken from the reference genome hg19 file. If you al-ign the obtained sequence, you may get mismatches with the original GB transcript.
NR_037421 sequence blast alignment
# genomic regions on '+' strand >NR_037421 CGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGATCGCCGAGGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCC GCCGGCGGCGGTGAGGCCCCGCGCGTGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTCCC CGCCGGCCGCCTTTCTCGCG
The sequence was submitted to BLAST for control
NR_038327 spliced and reversed sequence blast alignment
Manipulate SAM and BAM data
Sort, compress, and index SAM data to BAM
SAM format is text-based and we could use a grep or awk command to inspect a full file (sorted by coordinate) and visualise lines comprised in a given interval. Lucklily, this will not be necessary as samtools has a special way to do it. Before extracting data from a SAM file, it needs to be compressed to BAM and indexed; this can be done either using samtools or Picard. This has been already covered in former exercises but is repeated here for clarity.
Sort, Compress, and Index SAM data with samtools
infolder="Final_Results/hg19_bwa-mapping" # full mem data # infile="shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.sam" # sample mem data infile="shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_pg.sam" outfolder="data_extraction" mkdir -p ${outfolder} # sort the SAM data and compress it # samtools view -Sb ${infolder}/${infile} | # samtools sort - \ # ${outfolder}/samtools_${infile%%.sam} # In this revised version the -u flag replaces the -b flag to save time on compressing and decompressing bam files # this is only possible because we are in a pipe and because sort will create compressed BAM data as final output # -S tells SAM is input, -u tells to keep the bam uncompressed to save cpu time, finally, -h tells to take the header samtools view -Suh ${infolder}/${infile} | samtools sort - \ ${outfolder}/samtools_${infile%%.sam} # indexing is done very simply samtools index ${outfolder}/samtools_${infile%%.sam}.bam
Sort, Compress, and index SAM data with Picard
infolder="Final_Results/hg19_bwa-mapping" # full mem data # infile="shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.sam" # sample mem data infile="shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_pg.sam" outfolder="data_extraction" mkdir -p ${outfolder} # sort records by coordinate, compress and index in one command java -Xmx4g -jar /opt/biotools/picard/SortSam.jar \ I=${infolder}/${infile} \ O=${outfolder}/picard_${infile%%.sam}.bam \ SO=coordinate \ CREATE_INDEX=true
Extract locus / region data from a SAM/BAM file
Extract locus data with samtools
infolder="Final_Results/hg19_bwa-mapping" # full mem data # infile="shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe.bam" # sample mem data infile="shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe.bam" outfolder="data_extraction" mkdir -p ${outfolder} # the SOD1 gene as example # chr21:33031935-33041243 # id = NM_000454 region="chr21:33031935-33041243" # reading from file and region and output with header to the screen samtools view -h ${infolder}/${infile} ${region} # read region and store results (with header) to a new BAM file samtools view -b ${infolder}/${infile} ${region} > ${outfolder}/region_${infile} # index the file for IGV and other apps samtools index ${outfolder}/region_${infile}
samtools is also able to retrieve reads from internet hosted sorted data provided the bam index is present
Fetch region reads from the low_coverage 1000g data hosted at EBI
url="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/" infolder="/data/NA18507/alignment/" infile="NA18507.mapped.ILLUMINA.bwa.YRI.low_coverage.20120522.bam" # this data is mapped against b37 => no 'chr' b37_region="21:33031935-33041243" outfolder="data_extraction" mkdir -p ${outfolder} samtools view -b ${url}/${infolder}/${infile} ${b37_region} >\ ${outfolder}//region_${infile} samtools index ${outfolder}/region_${infile}
Manipulate BIG - VCF, BED or other tabular data
Creating a compressed version of a file is required to allow random access to its content (vs sequential access for the text form). When a file is sorted and compressed, its content can be quickly addressed using coordinates when an index is present. Compressing and indexing tabular coordinate based files is done using bgzip and tabix. Later retrieval is carried out by tabix from the compressed file and the region string (quite similarly to samtools view for BAM).
compress data with bgzip and index with tabix
bgzip is an adapted version of gzip (Blocked, Bigger & Better GZIP!). It comes with increased performance at the retrieval side and a slightly larger archive size as trade off. If you need more about bgzip speed, please read [1].
bgzip command details
Usage: bgzip [options] [file] ... Options: -c write on standard output, keep original files unchanged -d decompress -f overwrite files without asking -b INT decompress at virtual file pointer INT -s INT decompress INT bytes in the uncompressed file -h give this help
tabix command details
Program: tabix (TAB-delimited file InderXer) Version: 0.2.5 (r1005) Usage: tabix <in.tab.bgz> [region1 [region2 [...]]] Options: -p STR preset: gff, bed, sam, vcf, psltbl [gff] -s INT sequence name column [1] -b INT start column [4] -e INT end column; can be identical to '-b' [5] -S INT skip first INT lines [0] -c CHAR symbol for comment/meta lines [#] -r FILE replace the header with the content of FILE [null] -B region1 is a BED file (entire file will be read) -0 zero-based coordinate -h print also the header lines -H print only the header lines -l list chromosome names -f force to overwrite the index
VCF data example
outfolder="data_extraction" mkdir -p ${outfolder} infolder="Final_Results/vcf-venn/gold-set" infile="chr21_common_NA18507.vcf" # compression has already be done with bgzip but if we had a text VCF file we would use bgzip -c ${infolder}/${infile} > ${outfolder}/${infile}.gz # indexing tabix -f -p vcf ${outfolder}/${infile}.gz
BED data example
We use here the refgene file as well as the chr21_wgEncodeCrgMapabilityAlign100mer.bg file both present in the ref folder
infolder="ref" infile="hg19_refGene.bed" infile2="chr21_wgEncodeCrgMapabilityAlign100mer.bg" bgzip -c ${infolder}/${infile} > ${outfolder}/${infile}.gz bgzip -c ${infolder}/${infile2} > ${outfolder}/${infile2}.gz # indexing tabix -f -p bed ${outfolder}/${infile}.gz tabix -f -p bed ${outfolder}/${infile2}.gz
Annovar data example (as a prototype for other tabular data)
Annovar intergenic data has format:
- intergenic NONE(dist=NONE),MIR3648(dist=346401) chr21 9479431 9479431 C T het 21 .
- with chr in column #3 start position in #4 and end position in #5
infolder="Final_Results/annovar_annotation" infile="common_NA18507.variant_function" outfolder="annovar_annotation" mkdir-p ${outfolder} # the data should be re-sorted to make sure that it is presented in chr and position order sort -k 3V,3 -k 4n,4 -k 5n,5 ${infolder}/${infile} >\ ${outfolder}/srt_${infile} bgzip -c ${outfolder}/srt_${infile} > ${outfolder}/srt_${infile}.gz ## index the resulting file # there is no header to skip or keep here, but both options exist tabix -f -S 0 -s 3 -b 4 -e 5 ${outfolder}/srt_${infile}.gz ## you can now retrieve data from the compressed & indexed file using tabix tabix ${outfolder}/srt_${infile}.gz chr21:10703925-10708925 ## resulting in intergenic TEKT4P2(dist=735331),TPTE(dist=202262) chr21 10703925 10703925 T C hom 222 . intergenic TEKT4P2(dist=735380),TPTE(dist=202213) chr21 10703974 10703974 C A het 225 . intergenic TEKT4P2(dist=739135),TPTE(dist=198458) chr21 10707729 10707729 T G het 163 . intergenic TEKT4P2(dist=739190),TPTE(dist=198403) chr21 10707784 10707784 C T het 191 . intergenic TEKT4P2(dist=739351),TPTE(dist=198242) chr21 10707945 10707945 C T het 18.1 . intergenic TEKT4P2(dist=739352),TPTE(dist=198241) chr21 10707946 10707946 A G het 225 . intergenic TEKT4P2(dist=739577),TPTE(dist=198016) chr21 10708171 10708171 A C hom 222 .
Retrieve data with tabix
Retrieving data is not more difficult than for reads with samtools view and '-h' can be added to keep headers when they are useful and present
# the SOD1 gene region as example # chr21:33031935-33041243 # id = NM_000454 region="chr21:33031935-33041243" # reading from file and region and to the screen (show only top 10 lines) tabix ${infolder}/${infile}.gz ${region} | head # same but output with header (if exists) tabix -h ${infolder}/${infile}.gz ${region}
get all SOD variants from the 1000genome directly from the ether
# as 1000g is based on B367 io hg19, we need to remove 'chr' in the address and consider that coordinates are identical between builds b37_region="21:33031935-33041243" outfolder="data_extraction" tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ${b37_region} |\ bgzip -c > ${outfolder}/SOD_1000g.vcf.gz # and index the result tabix -p vcf ${outfolder}/SOD_1000g.vcf.gz
results for the examples
# from the VCF example chr21 33036814 . GAGAAGAAGA GAGAAGA 217 . SF=0,1,2 GT:PL:GQ 0/1:255,0,255:99 chr21 33037482 . A T 219 . SF=0,1,2 GT:PL:GQ 0/1:249,0,252:99 chr21 33037564 . A G 225 . SF=0,1,2 GT:PL:GQ 0/1:255,0,219:99 chr21 33040162 . C T 225 . SF=0,1,2 GT:PL:GQ 0/1:255,0,255:99 chr21 33040326 . CTT CT 217 . SF=0,1,2 GT:PL:GQ 0/1:255,0,255:99 chr21 33040371 . C T 225 . SF=0,1,2 GT:PL:GQ 0/1:255,0,255:99 # from the first BED example chr21 33031934 33041243 NM_000454 0 + 33032082 33040891 0 5 220,97,70,118,460, 0,4168,6827,7636,8849, # first 10 lines from second BED example chr21 33029853 33033699 1 chr21 33033699 33033700 0.5 chr21 33033700 33033701 0.166667 chr21 33033701 33033702 0.125 chr21 33033702 33033703 0.1 chr21 33033703 33033704 0.25 chr21 33033704 33033707 0.0384615 chr21 33033707 33033716 0.25 chr21 33033716 33033717 0.2 chr21 33033717 33033721 0.166667 ... # from the Annovar example intronic SOD1 chr21 33036821 33036823 AGA - het 217 . intronic SOD1 chr21 33037482 33037482 A T het 219 . intronic SOD1 chr21 33037564 33037564 A G het 225 . intronic SOD1 chr21 33040162 33040162 C T het 225 . intronic SOD1 chr21 33040328 33040328 T - het 217 . intronic SOD1 chr21 33040371 33040371 C T het 225 .
always ensure that the data is sorted by 1:chr 2:start 3:stop before compressing and indexing it. Also control if the data is 0-based and if headers are present
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.8 ]