NGS Exercise.4b
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5 ]
Advanced usage of BEDTOOLS to retrieve unsequenced regions of chr21 for Sanger re-sequencing (not covered today)
Contents
- 1 introduction
- 2 retrieve the reference sequence of un-sequenced exomic regions of chr21 (from BWA mem)
- 3 keep only refGene exonic fragments
- 4 keep only refGene exonic CODING fragments
- 5 try the flank command and have a look in IGV
- 6 use data.bits URLs directly in IGV (no download required)
- 7 download exercise files
introduction
Bedtools is again helping us here with more powerful commands.
only Bedtools commands with '*' are used today (and only partially!)
The bedtools sub-commands include: [ Genome arithmetic ] * intersect Find overlapping intervals in various ways. window Find overlapping intervals within a window around an interval. closest Find the closest, potentially non-overlapping interval. coverage Compute the coverage over defined intervals. map Apply a function to a column for each overlapping interval. * genomecov Compute the coverage over an entire genome. * merge Combine overlapping/nearby intervals into a single interval. cluster Cluster (but don't merge) overlapping/nearby intervals. complement Extract intervals _not_ represented by an interval file. * subtract Remove intervals based on overlaps b/w two files. * slop Adjust the size of intervals. * flank Create new intervals from the flanks of existing intervals. sort Order the intervals in a file. random Generate random intervals in a genome. shuffle Randomly redistrubute intervals in a genome. annotate Annotate coverage of features from multiple files. [ Multi-way file comparisons ] multiinter Identifies common intervals among multiple interval files. unionbedg Combines coverage intervals from multiple BEDGRAPH files. [ Paired-end manipulation ] pairtobed Find pairs that overlap intervals in various ways. pairtopair Find pairs that overlap other pairs in various ways. [ Format conversion ] bamtobed Convert BAM alignments to BED (& other) formats. bedtobam Convert intervals to BAM records. bamtofastq Convert BAM records to FASTQ records. bedpetobam Convert BEDPE intervals to BAM records. * bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals. [ Fasta manipulation ] * getfasta Use intervals to extract sequences from a FASTA file. maskfasta Use intervals to mask sequences from a FASTA file. nuc Profile the nucleotide content of intervals in a FASTA file. [ BAM focused tools ] multicov Counts coverage from multiple BAMs at specific intervals. tag Tag BAM alignments based on overlaps with interval files. [ Statistical relationships ] jaccard Calculate the Jaccard statistic b/w two sets of intervals. reldist Calculate the distribution of relative distances b/w two files. [ Miscellaneous tools ] overlap Computes the amount of overlap from two intervals. igv Create an IGV snapshot batch script. links Create a HTML page of links to UCSC locations. * makewindows Make interval "windows" across a genome. groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy") <=== THIS IS A GREAT COMMAND!! expand Replicate lines based on lists of values in columns. [ General help ] --help Print this help menu. --version What version of bedtools are you using?. --contact Feature requests, bugs, mailing lists, etc.
In order to identify no-coverage regions overlapping exons, we need a gene model database. We choose here refseq Gene as it is the most common model used by the community. The current version of the refSeq Genes BED file can be obtained from the UCSC tables for hg19 refGene as a BED12 ( “blocked” BED) file reporting each gene model as a separate line with coordinates for each exon.
First three records of the UCSC table download
head -3 ref/hg19_refGene.bed | column -t
chr1 11873 14409 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347, chr1 14361 29370 NR_024540 0 - 29370 29370 0 11 468,69,152,159,198,136,137,147,9 9,154,50, 0,608,1434,2245,2496,2871,3244,3553,3906,10376,14959, chr1 34610 36081 NR_026818 0 - 36081 36081 0 3 564,205,361, 0,666,1110,
convert the UCSC hg19 refGene BED12 transcript track to a BED6 exon track
When using this track as a database, bedtools will not consider exons but instead the full transcript coordinate system and intronic regions will be included. To avoid this, it is necessary to produce a split-file where each separate exon is extracted and annotated by the original transcript GB identifier (we use bedtools bed12tobed6 for this). Here again we have an issue with multiple gene models using shared or overlapping exons which leads to many duplicated lines. We can avoid this by sorting the exon list by coordinate and merge overlapping exons into a single line baring all GB ids as a list of names (we use bedtools merge for that, with the consequence that the largest coordinates will be used). The resulting file was stored under ref/hg19_refGene_bed6.bed
Bedtools bed12tobed6 command
Tool: bedtools bed12tobed6 (aka bed12ToBed6) Version: v2.17.0-111-gab7bbbe Summary: Splits BED12 features into discrete BED6 features. Usage: bedtools bed12tobed6 [OPTIONS] -i <bed12> Options: -n Force the score to be the (1-based) block number from the BED12.
Bedtools merge command (new in bedtools version 2.20)
Tool: bedtools merge (aka mergeBed) Version: v2.20.1 Summary: Merges overlapping BED/GFF/VCF entries into a single interval. Usage: bedtools merge [OPTIONS] -i <bed/gff/vcf> Options: -s Force strandedness. That is, only merge features that are on the same strand. - By default, merging is done without respect to strand. -S Force merge for one specific strand only. Follow with + or - to force merge from only the forward or reverse strand, respectively. - By default, merging is done without respect to strand. -d Maximum distance between features allowed for features to be merged. - Def. 0. That is, overlapping & book-ended features are merged. - (INTEGER) - Note: negative values enforce the number of b.p. required for overlap. -header Print the header from the A file prior to results. -c Specify columns from the input file to operate upon (see -o option, below). Multiple columns can be specified in a comma-delimited list. -o Specify the operation that should be applied to -c. Valid operations: sum, min, max, absmin, absmax, mean, median, collapse (i.e., print a delimited list (duplicates allowed)), distinct (i.e., print a delimited list (NO duplicates allowed)), count count_distinct (i.e., a count of the unique values in the column), Default: sum Multiple operations can be specified in a comma-delimited list. If there is only column, but multiple operations, all operations will be applied on that column. Likewise, if there is only one operation, but multiple columns, that operation will be applied to all columns. Otherwise, the number of columns must match the the number of operations, and will be applied in respective order. E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, the mean of column 4, and the count of column 6. The order of output columns will match the ordering given in the command. -delim Specify a custom delimiter for the collapse operations. - Example: -delim "|" - Default: ",". Notes: (1) The input file (-i) file must be sorted by chrom, then start.
# all conversion steps as a single piped command! # Use “stdin” when using piped input. infile=ref/hg19_refGene.bed # rem: the old command '-nms -delim "|"' from bedtools version 2.19 # is now replaced by '-c 4 -o distinct -delim "|"' in bedtools version 2.20 # the 4th BED column contains the name and we want each alt name only reported once #bed12ToBed6 -i ${infile} | \ # sort -k 1V,1 -k 2n,2 -k 3n,3 | \ # bedtools merge -nms -delim "|" -i stdin > ${infile%%.bed}_bed6.bed bed12ToBed6 -i ${infile} | \ sort -k 1V,1 -k 2n,2 -k 3n,3 | \ bedtools merge -c 4 -o distinct -delim "|" -i stdin > ${infile%%.bed}_bed6.bed head -5 ${infile%%.bed}_bed6.bed
chr1 11873 12227 NR_046018 chr1 12612 12721 NR_046018 chr1 13220 14829 NR_024540|NR_046018 chr1 14969 15038 NR_024540 chr1 15795 15947 NR_024540
Note that the third line refers to two different transcripts with overlapping exons merged to the largest footprint
download the list of refGene hg19 coding exons (CDS)
Not all exons are coding and users are often interesting by the translated regions only. For that, UCSC allows direct download of a BED6 list of chr21 coding exons (including UTRs parts). the file was stored under ref/hg19_chr21-refGene-CDS.bed. This list will be used to filter CDS-Exons later on and has a slightly different aspect than the exon list above.
head -5 ref/hg19_chr21-refGene-CDS.bed
chr21 10906904 10907040 NM_199259_cds_0_0_chr21_10906905_r 0 - chr21 10906904 10907040 NM_199260_cds_0_0_chr21_10906905_r 0 - chr21 10906904 10907040 NM_199261_cds_0_0_chr21_10906905_r 0 - chr21 10908824 10908895 NM_199259_cds_1_0_chr21_10908825_r 0 - chr21 10908824 10908895 NM_199260_cds_1_0_chr21_10908825_r 0 -
retrieve the reference sequence of un-sequenced exomic regions of chr21 (from BWA mem)
The code required to extract the fasta sequence of all candidate regions is shown next. A similar analysis can be done for the BWA aln data. Additional reference files are required and were added to the ref folder including hg19_gaps.bed listing no-ref regions on hg19, and hg19_refGene.bed reporting all exons in the hg19 refGene model.
In this part we will:
- filter exon fragments with bedtools intersect
- add 200bps either side of the obtained fragments
- merge overlapping fragments
- extract the corresponding fasta sequence from hg19
- also produce a flank-only bed file that could be used to design primers ...
exclude fragments without reference sequence (Nregions=gaps)
- bedtools subtract will remove segments in a file when present in a second file
bedtools subtract command details
Tool: bedtools subtract (aka subtractBed) Version: v2.17.0-111-gab7bbbe Summary: Removes the portion(s) of an interval that is overlapped by another feature(s). Usage: bedtools subtract [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> Options: -f Minimum overlap required as a fraction of A. - Default is 1E-9 (i.e., 1bp). - (FLOAT) (e.g. 0.50) -s Require same strandedness. That is, only subtract hits in B that overlap A on the _same_ strand. - By default, overlaps are subtracted without respect to strand. -S Force strandedness. That is, only subtract hits in B that overlap A on the _opposite_ strand. - By default, overlaps are subtracted without respect to strand. -A Remove entire feature if any overlap. That is, by default, only subtract the portion of A that overlaps B. Here, if any overlap is found (or -f amount), the entire feature is removed. -N Same as -A except when used with -f, the amount is the sum of all features (not any single feature).
genome=ref/hg19.genome refgene=ref/hg19_refGene.bed # exclude fragments known as gaps gaps=ref/hg19_gaps.bed infolder=Final_Results/bedtools_genomeCvg outfolder="bedtools_genomeCvg" mkdir -p ${outfolder} infile=chr21_nocvg-PE_NA18507_GAIIx_100_chr21_mem.bed outfile=chr21_nogap-nocvg.bed bedtools subtract -a ${infolder}/${infile} -b ${gaps} > ${outfolder}/${outfile} len=$( wc -l ${infolder}/${outfile} | cut -d " " -f 1 ); wid=$( awk 'BEGIN{FS="\t";OFS="\t";sum=0}{sum+=$3-$2}END{print sum}' ${infolder}/${outfile} ); echo -e "count\twidth\n${len}\t${wid}" | column -t
count width 1693 1073302 # preview top 5 lines of input data head -5 ${infolder}/${outfile}
chr21 9411193 9436193 no-coverage 1 chr21 9436293 9438035 no-coverage 1 chr21 9438135 9438205 no-coverage 1 chr21 9438305 9439834 no-coverage 1 chr21 9439934 9449018 no-coverage 1
keep only refGene exonic fragments
- bedtools intersect will take the file reporting no-coverage regions not in gaps and extract from it all fragments overlapping RefGene exons taken from another BED file; This allows focusing on the exomic part of the un-sequenced genome
bedtools intersect command details
Tool: bedtools intersect (aka intersectBed) Version: v2.17.0-111-gab7bbbe Summary: Report overlaps between two feature files. Usage: bedtools intersect [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> Options: -abam The A input file is in BAM format. Output will be BAM as well. Replaces -a. -ubam Write uncompressed BAM output. Default writes compressed BAM. -bed When using BAM input (-abam), write output as BED. The default is to write output in BAM when using -abam. -wa Write the original entry in A for each overlap. -wb Follow the A entry with the original entry in B for each overlap. - Useful for knowing _what_ A overlaps. Restricted by -f and -r. -loj Perform a "left outer join". That is, for each feature in A report each overlap with B. If no overlaps are found, report a NULL feature for B. -wo Write the original A and B entries plus the number of base pairs of overlap between the two features. - Overlaps restricted by -f and -r. Only A features with overlap are reported. -wao Write the original A and B entries plus the number of base pairs of overlap between the two features. - Overlapping features restricted by -f and -r. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0. -u Write the original A entry _once_ if _any_ overlaps found in B. - In other words, just report the fact >=1 hit was found. - Overlaps restricted by -f and -r. -c For each entry in A, report the number of overlaps with B. - Reports 0 for A entries that have no overlap with B. - Overlaps restricted by -f and -r. -v Only report those entries in A that have _no overlaps_ with B. - Similar to "grep -v" (an homage). -f Minimum overlap required as a fraction of A. - Default is 1E-9 (i.e., 1bp). - FLOAT (e.g. 0.50) -r Require that the fraction overlap be reciprocal for A and B. - In other words, if -f is 0.90 and -r is used, this requires that B overlap 90% of A and A _also_ overlaps 90% of B. -s Require same strandedness. That is, only report hits in B that overlap A on the _same_ strand. - By default, overlaps are reported without respect to strand. -S Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. - By default, overlaps are reported without respect to strand. -split Treat "split" BAM or BED12 entries as distinct BED intervals. -sorted Use the "chromsweep" algorithm for sorted (-k1,1 -k2,2n) input -header Print the header from the A file prior to results. Notes: (1) When a BAM file is used for the A file, the alignment is retained if overlaps exist, and exlcuded if an overlap cannot be found. If multiple overlaps exist, they are not reported, as we are only testing for one or more overlaps.
# keep only chr21_nogap-nocvg overlapping refGene exons refgene=ref/hg19_refGene_bed6.bed infolder=Final_Results/bedtools_genomeCvg outfolder=bedtools_genomeCvg mkdir -p ${outfolder} infile=chr21_nogap-nocvg.bed outfile=chr21_nogap-nocvg_exonic.bed bedtools intersect -wa -a ${infolder}/${infile} -b ${refgene} | \ sort -k 1V,1 -k 2n,2 -k 3n,3 | uniq > \ ${outfolder}/${outfile} head -5 ${infolder}/${outfile} | column -t
chr21 9825829 9825862 no-coverage 1 chr21 9825896 9826107 no-coverage 1 chr21 9826200 9826976 no-coverage 1 chr21 9900995 9912138 no-coverage 1 chr21 9912213 9919835 no-coverage 1
add 200 bps at either ends
- bedtools slop will take the file obtained above and add 200bps to both ends to design PCR primers outside of the missing regions
bedtools slop command details
Tool: bedtools slop (aka slopBed) Version: v2.17.0-111-gab7bbbe Summary: Add requested base pairs of "slop" to each feature. Usage: bedtools slop [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)] Options: -b Increase the BED/GFF/VCF entry -b base pairs in each direction. - (Integer) or (Float, e.g. 0.1) if used with -pct. -l The number of base pairs to subtract from the start coordinate. - (Integer) or (Float, e.g. 0.1) if used with -pct. -r The number of base pairs to add to the end coordinate. - (Integer) or (Float, e.g. 0.1) if used with -pct. -s Define -l and -r based on strand. E.g. if used, -l 500 for a negative-stranded feature, it will add 500 bp downstream. Default = false. -pct Define -l and -r as a fraction of the feature's length. E.g. if used on a 1000bp feature, -l 0.50, will add 500 bp "upstream". Default = false. -header Print the header from the input file prior to results. Notes: (1) Starts will be set to 0 if options would force it below 0. (2) Ends will be set to the chromosome length if requested slop would force it above the max chrom length. (3) The genome file should tab delimited and structured as follows: <chromName><TAB><chromSize> For example, Human (hg19): chr1 249250621 chr2 243199373 ... chr18_gl000207_random 4262 Tips: One can use the UCSC Genome Browser's MySQL database to extract chromosome sizes. For example, H. sapiens: mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ "select chrom, size from hg19.chromInfo" > hg19.genome
# add 200bps either ends of the original bed file and remove duplicates genome=ref/hg19.genome infolder=Final_Results/bedtools_genomeCvg outfolder=bedtools_genomeCvg mkdir -p ${outfolder} infile=chr21_nogap-nocvg_exonic.bed outfile=chr21_nogap-nocvg_exonic_slop200.bed bedtools slop -b 200 -i ${infolder}/${infile} -g ${genome} > \ ${outfolder}/${outfile} head -5 ${infolder}/${outfile} | column -t
chr21 9825629 9826062 no-coverage 1 chr21 9825696 9826307 no-coverage 1 chr21 9826000 9827176 no-coverage 1 chr21 9900795 9912338 no-coverage 1 chr21 9912013 9920035 no-coverage 1
# count slop'ped intervals without sequence overlapping chr21 exons wc -l ${infolder}/${outfile} 70 chr21_nogap-nocvg_exonic_slop200.bed
Note that these two sequences are overlapping! we should consider merging chr21_nogap-nocvg_exonic_slop200.bed overlapping lines before extracting the fasta sequence
merge overlapping fragments
- bedtools merge will collapse overlapping records into one
bedtools merge command details
Tool: bedtools merge (aka mergeBed) Version: v2.17.0-111-gab7bbbe Summary: Merges overlapping BED/GFF/VCF entries into a single interval. Usage: bedtools merge [OPTIONS] -i <bed/gff/vcf> Options: -s Force strandedness. That is, only merge features that are the same strand. - By default, merging is done without respect to strand. -n Report the number of BED entries that were merged. - Note: "1" is reported if no merging occurred. -d Maximum distance between features allowed for features to be merged. - Def. 0. That is, overlapping & book-ended features are merged. - (INTEGER) -nms Report the names of the merged features separated by commas. Change delim. with -delim. -scores Report the scores of the merged features. Specify one of the following options for reporting scores: sum, min, max, mean, median, mode, antimode, collapse (i.e., print a semicolon-separated list), - (INTEGER) -delim Specify a custom delimiter for the -nms and -scores concat options - Example: -delim "|" - Default: ",". Notes: (1) All output, regardless of input type (e.g., GFF or VCF) will in BED format with zero-based starts (2) The input file (-i) file must be sorted by chrom, then start.
# merge overlapping fragments infolder=Final_Results/bedtools_genomeCvg outfolder=bedtools_genomeCvg mkdir -p ${outfolder} infile=chr21_nogap-nocvg_exonic_slop200.bed outfile=chr21_nogap-nocvg_exonic_slop200_merged.bed bedtools merge -i ${infolder}/${infile} > ${outfolder}/${outfile} # count slop'ped intervals without sequence after merging wc -l ${infolder}/${outfile} 59 bedtools_genomeCvg/chr21_nogap-nocvg_exonic_slop200_merged.bed
extract fasta for the obtained fragments
- bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file
bedtools getfasta command details
Tool: bedtools getfasta (aka fastaFromBed) Version: v2.17.0-111-gab7bbbe Summary: Extract DNA sequences into a fasta file based on feature coordinates. Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> Options: -fi Input FASTA file -bed BED/GFF/VCF file of ranges to extract from -fi -fo Output file (can be FASTA or TAB-delimited) -name Use the name field for the FASTA header -split given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons) -tab Write output in TAB delimited format. - Default is FASTA format. -s Force strandedness. If the feature occupies the antisense, strand, the sequence will be reverse complemented. - By default, strand information is ignored.
We do twice the extraction here.
- first from all regions including overlapping ones (creating partial duplicates).
# extract the hg19 merged sequences (some overlap!) infolder=Final_Results/bedtools_genomeCvg outfolder=bedtools_genomeCvg mkdir -p ${outfolder} infile=chr21_nogap-nocvg_exonic_slop200.bed infasta=ref/HiSeq_UCSC_hg19.fa outfasta=chr21_nogap-nocvg_exonic_slop200.fasta bedtools getfasta -fi ${infasta} -bed ${infolder}/${infile} -fo ${outfolder}/${outfasta} # inspect the first two sequences head -4 ${infolder}/${outfasta} | sed -e "s/.\{60\}/&\n/g"
>chr21:9825629-9826062 TGGCCCCGCCTGGGACCGAACCCGGCACCGCCTCGTGGGGCGCCGCCGCCGGCCACTGAT CGGCCCGGCGTCCGCGTCCCCCGGCGCGCGCCTTGGGGACCGGGTTGGTGGCGCCCCGCG TGGGGCCCGGTGGGCTTCCCGGAGGGTTCCGGGGGTCGGCCTGCGGCGCGTGCGGGGGAG GAGACGGTTCCGGGGGACCGGCCGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGA TCGCCGAGGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCCGCCGGCGGCGGTGAGGCC CCGCGCGTGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTC CCCGCCGGCCGCCTTTCTCGCGCCTTCCCCGTCGCCCCGGCCTCGCCCGTGGTCTCTCGT CTTCTCCCGGCCC >chr21:9825696-9826307 GCGTCCGCGTCCCCCGGCGCGCGCCTTGGGGACCGGGTTGGTGGCGCCCCGCGTGGGGCC CGGTGGGCTTCCCGGAGGGTTCCGGGGGTCGGCCTGCGGCGCGTGCGGGGGAGGAGACGG TTCCGGGGGACCGGCCGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGATCGCCGA GGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCCGCCGGCGGCGGTGAGGCCCCGCGCG TGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTCCCCGCCG GCCGCCTTTCTCGCGCCTTCCCCGTCGCCCCGGCCTCGCCCGTGGTCTCTCGTCTTCTCC CGGCCCGCTCTTCCGAACCGGGTCGGCGCGTCCCCCGGGTGCGCCTCGCTTCCCGGGCCT GCCGCGGCCCTTCCCCGAGGCGTCCGTCCCGGGCGTCGGCGTCGGGGAGAGCCCGTCCTC CCCGCGTGGCGTCGCCCCGTTCGGCGCGCGCGTGCGCCCGAGCGCGGCCCGGTGGTCCCT CCCGGACAGGCGTTCGTGCGACGTGTGGCGTGGGTCGACCTCCGCCTTGCCGGTCGCTCG CCCTTTCCCCG
- and now after collapsing overlapping regions to the largest overlap.
# extract the hg19 merged sequences (less but much longer) infolder=bedtools_genomeCvg infile=chr21_nogap-nocvg_exonic_slop200.bed infile=chr21_nogap-nocvg_exonic_slop200_merged.bed infasta=ref/HiSeq_UCSC_hg19.fa outfasta=chr21_nogap-nocvg_exonic_slop200_merged.fasta bedtools getfasta -fi ${infasta} -bed ${infolder}/${infile} -fo ${infolder}/${outfasta} # inspect the first two sequences head -4 ${infolder}/${outfasta} | sed -e "s/.\{60\}/&\n/g"
>chr21:9825629-9827176 TGGCCCCGCCTGGGACCGAACCCGGCACCGCCTCGTGGGGCGCCGCCGCCGGCCACTGAT CGGCCCGGCGTCCGCGTCCCCCGGCGCGCGCCTTGGGGACCGGGTTGGTGGCGCCCCGCG TGGGGCCCGGTGGGCTTCCCGGAGGGTTCCGGGGGTCGGCCTGCGGCGCGTGCGGGGGAG ... GGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTAC GCACGGCCGGTACAGTGAAACTGCGAATGGCTCATTAAATCAGTTATGGTTCCTTTGGTC GCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAGAGCTAAT >chr21:9900795-9920035 AAAGTTTTCTTAAATATGAGACCAGCAACAAGAATAAAAATTGATGAATTGGACTTCATC AAAATTAAACATTTTTGCACTTCAAAGGACACCATCAAGAAAGTGAAAAGACAACTCACA AGATGGAAGAAAATACTTGTAAATCATGTGTAGCACTTTTTGATGATAAACTTGAGGAAG ... TCCACTCTGACCTGACCTTCAGTAGTTCCTTTATTTTTATTTTCAAGCATTTTTGTGAGT ATATAATAAGCATATATATTTATGGGGTGCATGAGATGTTTTGATACAGGCATGCAGCAT GAACTAATCACATCATGGAGAGTGGGTATCCATCCCTTTA
BED files created during this exercise can be visualized in IGV to spot these regions (we provide them from the bottom link
keep only refGene exonic CODING fragments
- bedtools intersect will take the file reporting no-coverage regions not in gaps and extract from it all fragments overlapping RefGene CDS-exons taken from another BED file. This allows focusing on the CODING & exomic part of the un-sequenced genome
# keep only chr21_nogap-nocvg overlapping refGene CDS-exons refgene=ref/merged_hg19_chr21-refGene-CDS.bed infile=chr21_nogap-nocvg.bed infolder=bedtools_genomeCvg outfile=chr21_nogap-nocvg_CDS-exonic.bed bedtools intersect -wa -a ${infolder}/${infile} -b ${refgene} | \ sort -k 1V,1 -k 2n,2 -k 3n,3 | uniq > \ ${infolder}/${outfile} # count them wc -l ${infolder}/${outfile}
22 bedtools_genomeCvg/chr21_nogap-nocvg_CDS-exonic.bed
# and show some head -5 ${infolder}/${outfile} | column -t
chr21 31913981 31913982 no-coverage 1 chr21 33245961 33246026 no-coverage 1 chr21 34399827 34399850 no-coverage 1 chr21 34399902 34400050 no-coverage 1 chr21 34442935 34442953 no-coverage 1
add 200 bps at either ends
- bedtools slop will take the file obtained above and add 200bps to both ends to design PCR primers outside of the missing regions
# add 200bps either ends of the original bed file and remove duplicates genome=ref/hg19.genome infile=chr21_nogap-nocvg_CDS-exonic.bed infolder=bedtools_genomeCvg outfile=chr21_nogap-nocvg_CDS-exonic_slop200.bed bedtools slop -b 200 -i ${infolder}/${infile} -g ${genome} > \ ${infolder}/${outfile} head -5 ${infolder}/${outfile} | column -t
chr21 31913781 31914182 no-coverage 1 chr21 33245761 33246226 no-coverage 1 chr21 34399627 34400050 no-coverage 1 chr21 34399702 34400250 no-coverage 1 chr21 34442735 34443153 no-coverage 1
How many of the 22 remain after merging overlaps?
bedtools merge -i bedtools_genomeCvg/chr21_nogap-nocvg_CDS-exonic_slop200.bed | wc -l
20
This is reasonable, lets take those for further processing.
extract fasta for the obtained fragments
- bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file
# extract the hg19 merged sequences (some overlap!) infolder=bedtools_genomeCvg infile=chr21_nogap-nocvg_CDS-exonic_slop200.bed infasta=ref/HiSeq_UCSC_hg19.fa outfasta=chr21_nogap-nocvg_CDS-exonic_slop200.fasta bedtools getfasta -fi ${infasta} -bed ${infolder}/${infile} -fo ${infolder}/${outfasta} # inspect the first two sequences head -4 ${infolder}/${outfasta} | sed -e "s/.\{60\}/&\n/g"
>chr21:31913781-31914182 GACTAAATAATAATAGGTGAAACCCTGATTCAATGTAATGATTCCAGATTCTTCCATAAA AACTGATGGCTAGTTCCTTCTGGAGCATGAAGCCAAAAAATGGTAGGTATTTTCTCTACA TTGAGAAAAGTGTTCGCCTTGCTGAAGCATACGGGCAAGATCTTGGAGTTAGAGTATGAG AATCAGCAAGTTTTTTAGTAGAATCCAGAGAATCCATATCCTTCACGGCATGATGGGCGG CAGCAGCCATATCTATAGCCTCCATAGCCAGAGCCATATCTGTAGCCTCCACAGCCACAG CCATAGCCCAGACCACCAAAGCCTCCACAGCCATATCCCAGGCCTCTGTAGTAGCTGCCA TAGTATCTCATGGTGTCAGGGACTAGAGATTCAGTTCAATG >chr21:33245761-33246226 GCGGGGTCCCCGGTCCCGCGTCCCCTGGGCAGCCGCTATTGTCTACGCGCCTCGCTGGGC GGCGCGGGGGGCGTGATCGCGGCGGCCCCGGGCTCTGGGTGCGGAGACCCAGGCGGGGCT GGGCCCAGGGCGGCGGCGGGAGAAGCCGGGGAAGCCGAAGAGCCTGGGGAGGAGGAGCTG CGAGCGCGGGAGACGAGCAGGAGCCGCGCGGGCCGCGGCGAGCGCGATGCCGGCGGCGGC GGGGGACGGGCTCCTGGGGGAGCCGGCGGCGCCTGGGGGCGGCGGCGGCGCGGAGGACGC GGCCAGGCCCGCGGCGGCCTGCGAGGGAAGTTTCCTGCCTGCCTGGGTGAGCGGCGTGCC CCGCGAGCGGCTCCGCGACTTCCAGCACCACAAGCGCGTGGGCAACTACCTCATCGGCAG CAGGAAGCTGGGCGAGGGCTCCTTTGCCAAGGTGCGCGAGGGGCT
try the flank command and have a look in IGV
- bedtools flank is also worth mentioning here to extract the borders of a given bed list; It can be used for both sides or for one side and for a defined length or the a fraction of the length as the query fragment. (eg: flank can be useful to extract boundaries and design primers outside of a set of regions to be re-sequenced)
bedtools flank command details
Tool: bedtools flank (aka flankBed) Version: v2.17.0-111-gab7bbbe Summary: Creates flanking interval(s) for each BED/GFF/VCF feature. Usage: bedtools flank [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)] Options: -b Create flanking interval(s) using -b base pairs in each direction. - (Integer) or (Float, e.g. 0.1) if used with -pct. -l The number of base pairs that a flank should start from orig. start coordinate. - (Integer) or (Float, e.g. 0.1) if used with -pct. -r The number of base pairs that a flank should end from orig. end coordinate. - (Integer) or (Float, e.g. 0.1) if used with -pct. -s Define -l and -r based on strand. E.g. if used, -l 500 for a negative-stranded feature, it will start the flank 500 bp downstream. Default = false. -pct Define -l and -r as a fraction of the feature's length. E.g. if used on a 1000bp feature, -l 0.50, will add 500 bp "upstream". Default = false. -header Print the header from the input file prior to results. Notes: (1) Starts will be set to 0 if options would force it below 0. (2) Ends will be set to the chromosome length if requested flank would force it above the max chrom length. (3) In contrast to slop, which _extends_ intervals, bedtools flank creates new intervals from the regions just up- and down-stream of your existing intervals. (4) The genome file should tab delimited and structured as follows: <chromName><TAB><chromSize> For example, Human (hg19): chr1 249250621 chr2 243199373 ... chr18_gl000207_random 4262 Tips: One can use the UCSC Genome Browser's MySQL database to extract chromosome sizes. For example, H. sapiens: mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ "select chrom, size from hg19.chromInfo" > hg19.genome
######## bonus bedtools flank command ###################### # each line is one flank (left then right) respective to the initial segment infolder=bedtools_genomeCvg infile=chr21_nogap-nocvg_exonic.bed outfile=chr21_nogap-nocvg_exonic_flank200.bed bedtools flank -i ${infolder}/${infile} -g ${genome} -b 200 | \ sort -k 1V,1 -k 2n,2 -k 3n,3 | uniq > \ ${infolder}/${outfile} head -5 ${infolder}/${outfile} | column -t
chr21 9825629 9825829 no-coverage 1 chr21 9825696 9825896 no-coverage 1 chr21 9825862 9826062 no-coverage 1 chr21 9826000 9826200 no-coverage 1 chr21 9826107 9826307 no-coverage 1
Zooming in (chr21:33,245,874-33,246,120) to show the missing CDS
use data.bits URLs directly in IGV (no download required)
All BED files created during this exercise can be visualized in IGV to spot these regions and can even be loaded DIRECTLY within IGV from the following data.bits URLs
list of URL for IGV
http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic.bed http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic_slop200.bed http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic_slop200_merged.bed http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_CDS-exonic.bed http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_CDS-exonic_slop200.bed http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic_flank200.bed
Please read the doc to discover other bedtools, also become familiar with IGV, the combination is dynamite
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5 ]