Advanced usage of BEDTOOLS to retrieve unsequenced regions of chr21 for Sanger re-sequencing (not covered today)

1 introduction
- 1.1 convert the UCSC hg19 refGene BED12 transcript track to a BED6 exon track
- 1.2 download the list of refGene hg19 coding exons (CDS)
2 retrieve the reference sequence of un-sequenced exomic regions of chr21 (from BWA mem)
- 2.1 exclude fragments without reference sequence (Nregions=gaps)
3 keep only refGene exonic fragments
4 keep only refGene exonic CODING fragments
- 4.1 add 200 bps at either ends
- 4.2 extract fasta for the obtained fragments
5 try the flank command and have a look in IGV
6 use data.bits URLs directly in IGV (no download required)
7 download exercise files

introduction

Bedtools is again helping us here with more powerful commands.

only Bedtools commands with '*' are used today (and only partially!)

The bedtools sub-commands include:
[ Genome arithmetic ]
* intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
* genomecov Compute the coverage over an entire genome.
* merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
* subtract Remove intervals based on overlaps b/w two files.
* slop Adjust the size of intervals.
* flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistrubute intervals in a genome.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
* bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
* getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
* makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy") <=== THIS IS A GREAT COMMAND!!
expand Replicate lines based on lists of values in columns.
[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.

In order to identify no-coverage regions overlapping exons, we need a gene model database. We choose here refseq Gene as it is the most common model used by the community. The current version of the refSeq Genes BED file can be obtained from the UCSC tables for hg19 refGene as a BED12 ( “blocked” BED) file reporting each gene model as a separate line with coordinates for each exon.

First three records of the UCSC table download

head -3 ref/hg19_refGene.bed | column -t

chr1	11873	14409	NR_046018	0	+	14409	14409	0	3	354,109,1189,	0,739,1347,
chr1	14361	29370	NR_024540	0	-	29370	29370	0	11	468,69,152,159,198,136,137,147,9
9,154,50,	0,608,1434,2245,2496,2871,3244,3553,3906,10376,14959,
chr1	34610	36081	NR_026818	0	-	36081	36081	0	3	564,205,361,	0,666,1110,

convert the UCSC hg19 refGene BED12 transcript track to a BED6 exon track

When using this track as a database, bedtools will not consider exons but instead the full transcript coordinate system and intronic regions will be included. To avoid this, it is necessary to produce a split-file where each separate exon is extracted and annotated by the original transcript GB identifier (we use bedtools bed12tobed6 for this). Here again we have an issue with multiple gene models using shared or overlapping exons which leads to many duplicated lines. We can avoid this by sorting the exon list by coordinate and merge overlapping exons into a single line baring all GB ids as a list of names (we use bedtools merge for that, with the consequence that the largest coordinates will be used). The resulting file was stored under ref/hg19_refGene_bed6.bed

Bedtools bed12tobed6 command

Tool:    bedtools bed12tobed6 (aka bed12ToBed6)
Version: v2.17.0-111-gab7bbbe
Summary: Splits BED12 features into discrete BED6 features.
 
Usage:   bedtools bed12tobed6 [OPTIONS] -i <bed12>
 
Options: 
	-n	Force the score to be the (1-based) block number from the BED12.

Bedtools merge command (new in bedtools version 2.20)

Tool:    bedtools merge (aka mergeBed)
Version: v2.20.1
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.
 
Usage:   bedtools merge [OPTIONS] -i <bed/gff/vcf>
 
Options: 
	-s	Force strandedness.  That is, only merge features
		that are on the same strand.
		- By default, merging is done without respect to strand.
 
	-S	Force merge for one specific strand only.
		Follow with + or - to force merge from only
		the forward or reverse strand, respectively.
		- By default, merging is done without respect to strand.
 
	-d	Maximum distance between features allowed for features
		to be merged.
		- Def. 0. That is, overlapping & book-ended features are merged.
		- (INTEGER)
		- Note: negative values enforce the number of b.p. required for overlap.
 
	-header	Print the header from the A file prior to results.
 
	-c	Specify columns from the input file to operate upon (see -o option, below).
		Multiple columns can be specified in a comma-delimited list.
 
	-o	Specify the operation that should be applied to -c.
		Valid operations:
		    sum, min, max, absmin, absmax,
		    mean, median,
		    collapse (i.e., print a delimited list (duplicates allowed)), 
		    distinct (i.e., print a delimited list (NO duplicates allowed)), 
		    count
		    count_distinct (i.e., a count of the unique values in the column), 
		Default: sum
		Multiple operations can be specified in a comma-delimited list.
 
		If there is only column, but multiple operations, all operations will be
		applied on that column. Likewise, if there is only one operation, but
		multiple columns, that operation will be applied to all columns.
		Otherwise, the number of columns must match the the number of operations,
		and will be applied in respective order.
		E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
		the mean of column 4, and the count of column 6.
		The order of output columns will match the ordering given in the command.
 
 
	-delim	Specify a custom delimiter for the collapse operations.
		- Example: -delim "|"
		- Default: ",".
 
Notes: 
	(1) The input file (-i) file must be sorted by chrom, then start.

# all conversion steps as a single piped command!
# Use “stdin” when using piped input.
infile=ref/hg19_refGene.bed
 
# rem: the old command '-nms -delim "|"' from bedtools version 2.19 
# is now replaced by '-c 4 -o distinct -delim "|"' in bedtools version 2.20
# the 4th BED column contains the name and we want each alt name only reported once
 
#bed12ToBed6 -i ${infile} | \
#   sort -k 1V,1 -k 2n,2 -k 3n,3 | \
#   bedtools merge -nms -delim "|" -i stdin > ${infile%%.bed}_bed6.bed
 
bed12ToBed6 -i ${infile} | \
    sort -k 1V,1 -k 2n,2 -k 3n,3 | \
    bedtools merge -c 4 -o distinct -delim "|" -i stdin > ${infile%%.bed}_bed6.bed
 
head -5 ${infile%%.bed}_bed6.bed

chr1	11873	12227	NR_046018
chr1	12612	12721	NR_046018
chr1	13220	14829	NR_024540|NR_046018
chr1	14969	15038	NR_024540
chr1	15795	15947	NR_024540

Note that the third line refers to two different transcripts with overlapping exons merged to the largest footprint

download the list of refGene hg19 coding exons (CDS)

Not all exons are coding and users are often interesting by the translated regions only. For that, UCSC allows direct download of a BED6 list of chr21 coding exons (including UTRs parts). the file was stored under ref/hg19_chr21-refGene-CDS.bed. This list will be used to filter CDS-Exons later on and has a slightly different aspect than the exon list above.

head -5 ref/hg19_chr21-refGene-CDS.bed

chr21  10906904  10907040  NM_199259_cds_0_0_chr21_10906905_r  0  -
chr21  10906904  10907040  NM_199260_cds_0_0_chr21_10906905_r  0  -
chr21  10906904  10907040  NM_199261_cds_0_0_chr21_10906905_r  0  -
chr21  10908824  10908895  NM_199259_cds_1_0_chr21_10908825_r  0  -
chr21  10908824  10908895  NM_199260_cds_1_0_chr21_10908825_r  0  -

retrieve the reference sequence of un-sequenced exomic regions of chr21 (from BWA mem)

The code required to extract the fasta sequence of all candidate regions is shown next. A similar analysis can be done for the BWA aln data. Additional reference files are required and were added to the ref folder including hg19_gaps.bed listing no-ref regions on hg19, and hg19_refGene.bed reporting all exons in the hg19 refGene model.

In this part we will:

filter exon fragments with bedtools intersect
add 200bps either side of the obtained fragments
merge overlapping fragments
extract the corresponding fasta sequence from hg19
also produce a flank-only bed file that could be used to design primers ...

exclude fragments without reference sequence (Nregions=gaps)

bedtools subtract will remove segments in a file when present in a second file

bedtools subtract command details

Tool:    bedtools subtract (aka subtractBed)
Version: v2.17.0-111-gab7bbbe
Summary: Removes the portion(s) of an interval that is overlapped
	 by another feature(s).
 
Usage:   bedtools subtract [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
 
Options: 
	-f	Minimum overlap required as a fraction of A.
		- Default is 1E-9 (i.e., 1bp).
		- (FLOAT) (e.g. 0.50)
 
	-s	Require same strandedness.  That is, only subtract hits in B
		that overlap A on the _same_ strand.
		- By default, overlaps are subtracted without respect to strand.
 
	-S	Force strandedness.  That is, only subtract hits in B that
		overlap A on the _opposite_ strand.
		- By default, overlaps are subtracted without respect to strand.
 
	-A	Remove entire feature if any overlap.  That is, by default,
		only subtract the portion of A that overlaps B. Here, if
		any overlap is found (or -f amount), the entire feature is removed.
 
	-N	Same as -A except when used with -f, the amount is the sum
		of all features (not any single feature).

genome=ref/hg19.genome
refgene=ref/hg19_refGene.bed
 
# exclude fragments known as gaps
gaps=ref/hg19_gaps.bed
 
infolder=Final_Results/bedtools_genomeCvg
outfolder="bedtools_genomeCvg"
mkdir -p ${outfolder}
 
infile=chr21_nocvg-PE_NA18507_GAIIx_100_chr21_mem.bed
outfile=chr21_nogap-nocvg.bed
 
bedtools subtract -a ${infolder}/${infile} -b ${gaps} > ${outfolder}/${outfile}
 
len=$( wc -l ${infolder}/${outfile} | cut -d " " -f 1 );
wid=$( awk 'BEGIN{FS="\t";OFS="\t";sum=0}{sum+=$3-$2}END{print sum}' ${infolder}/${outfile} );
 
echo -e "count\twidth\n${len}\t${wid}" | column -t

count width
1693  1073302
 
# preview top 5 lines of input data
head -5 ${infolder}/${outfile}

chr21	9411193	9436193	no-coverage	1
chr21	9436293	9438035	no-coverage	1
chr21	9438135	9438205	no-coverage	1
chr21	9438305	9439834	no-coverage	1
chr21	9439934	9449018	no-coverage	1

keep only refGene exonic fragments

bedtools intersect will take the file reporting no-coverage regions not in gaps and extract from it all fragments overlapping RefGene exons taken from another BED file; This allows focusing on the exomic part of the un-sequenced genome

bedtools intersect command details

Tool:    bedtools intersect (aka intersectBed)
Version: v2.17.0-111-gab7bbbe
Summary: Report overlaps between two feature files.
 
Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>
 
Options: 
	-abam	The A input file is in BAM format.  Output will be BAM as well. Replaces -a.
 
	-ubam	Write uncompressed BAM output. Default writes compressed BAM.
 
	-bed	When using BAM input (-abam), write output as BED. The default
		is to write output in BAM when using -abam.
 
	-wa	Write the original entry in A for each overlap.
 
	-wb	Follow the A entry with the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.
 
	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.
 
	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.
 
	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f and -r.
		  However, A features w/o overlap are also reported
		  with a NULL B feature and overlap = 0.
 
	-u	Write the original A entry _once_ if _any_ overlaps found in B.
		- In other words, just report the fact >=1 hit was found.
		- Overlaps restricted by -f and -r.
 
	-c	For each entry in A, report the number of overlaps with B.
		- Reports 0 for A entries that have no overlap with B.
		- Overlaps restricted by -f and -r.
 
	-v	Only report those entries in A that have _no overlaps_ with B.
		- Similar to "grep -v" (an homage).
 
	-f	Minimum overlap required as a fraction of A.
		- Default is 1E-9 (i.e., 1bp).
		- FLOAT (e.g. 0.50)
 
	-r	Require that the fraction overlap be reciprocal for A and B.
		- In other words, if -f is 0.90 and -r is used, this requires
		  that B overlap 90% of A and A _also_ overlaps 90% of B.
 
	-s	Require same strandedness.  That is, only report hits in B
		that overlap A on the _same_ strand.
		- By default, overlaps are reported without respect to strand.
 
	-S	Require different strandedness.  That is, only report hits in B
		that overlap A on the _opposite_ strand.
		- By default, overlaps are reported without respect to strand.
 
	-split	Treat "split" BAM or BED12 entries as distinct BED intervals.
 
	-sorted	Use the "chromsweep" algorithm for sorted (-k1,1 -k2,2n) input
 
	-header	Print the header from the A file prior to results.
 
Notes: 
	(1) When a BAM file is used for the A file, the alignment is retained if overlaps exist,
	and exlcuded if an overlap cannot be found.  If multiple overlaps exist, they are not
	reported, as we are only testing for one or more overlaps.

# keep only chr21_nogap-nocvg overlapping refGene exons
refgene=ref/hg19_refGene_bed6.bed
 
infolder=Final_Results/bedtools_genomeCvg
outfolder=bedtools_genomeCvg
mkdir -p ${outfolder}
 
infile=chr21_nogap-nocvg.bed
outfile=chr21_nogap-nocvg_exonic.bed
 
bedtools intersect -wa -a ${infolder}/${infile} -b ${refgene} | \
	sort -k 1V,1 -k 2n,2 -k 3n,3 | uniq > \
	${outfolder}/${outfile}
 
head -5 ${infolder}/${outfile} | column -t

chr21  9825829  9825862  no-coverage  1
chr21  9825896  9826107  no-coverage  1
chr21  9826200  9826976  no-coverage  1
chr21  9900995  9912138  no-coverage  1
chr21  9912213  9919835  no-coverage  1

add 200 bps at either ends

bedtools slop will take the file obtained above and add 200bps to both ends to design PCR primers outside of the missing regions

bedtools slop command details

Tool:    bedtools slop (aka slopBed)
Version: v2.17.0-111-gab7bbbe
Summary: Add requested base pairs of "slop" to each feature.
 
Usage:   bedtools slop [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)]
 
Options: 
	-b	Increase the BED/GFF/VCF entry -b base pairs in each direction.
		- (Integer) or (Float, e.g. 0.1) if used with -pct.
 
	-l	The number of base pairs to subtract from the start coordinate.
		- (Integer) or (Float, e.g. 0.1) if used with -pct.
 
	-r	The number of base pairs to add to the end coordinate.
		- (Integer) or (Float, e.g. 0.1) if used with -pct.
 
	-s	Define -l and -r based on strand.
		E.g. if used, -l 500 for a negative-stranded feature, 
		it will add 500 bp downstream.  Default = false.
 
	-pct	Define -l and -r as a fraction of the feature's length.
		E.g. if used on a 1000bp feature, -l 0.50, 
		will add 500 bp "upstream".  Default = false.
 
	-header	Print the header from the input file prior to results.
 
Notes: 
	(1)  Starts will be set to 0 if options would force it below 0.
	(2)  Ends will be set to the chromosome length if  requested slop would
	force it above the max chrom length.
	(3)  The genome file should tab delimited and structured as follows:
 
	<chromName><TAB><chromSize>
 
	For example, Human (hg19):
	chr1	249250621
	chr2	243199373
	...
	chr18_gl000207_random	4262
 
Tips: 
	One can use the UCSC Genome Browser's MySQL database to extract
	chromosome sizes. For example, H. sapiens:
 
	mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
	"select chrom, size from hg19.chromInfo"  > hg19.genome

# add 200bps either ends of the original bed file and remove duplicates
genome=ref/hg19.genome
 
infolder=Final_Results/bedtools_genomeCvg
outfolder=bedtools_genomeCvg
mkdir -p ${outfolder}
 
infile=chr21_nogap-nocvg_exonic.bed
outfile=chr21_nogap-nocvg_exonic_slop200.bed
 
bedtools slop -b 200 -i ${infolder}/${infile} -g ${genome} > \
	${outfolder}/${outfile}
 
head -5 ${infolder}/${outfile} | column -t

chr21  9825629  9826062  no-coverage  1
chr21  9825696  9826307  no-coverage  1
chr21  9826000  9827176  no-coverage  1
chr21  9900795  9912338  no-coverage  1
chr21  9912013  9920035  no-coverage  1

# count slop'ped intervals without sequence overlapping chr21 exons
wc -l ${infolder}/${outfile}
70 chr21_nogap-nocvg_exonic_slop200.bed

Of the 1693 no-coverage regions, only 70 overlap at least one exonic nucleotide.

Note that these two sequences are overlapping! we should consider merging chr21_nogap-nocvg_exonic_slop200.bed overlapping lines before extracting the fasta sequence

merge overlapping fragments

bedtools merge will collapse overlapping records into one

bedtools merge command details

Tool:    bedtools merge (aka mergeBed)
Version: v2.17.0-111-gab7bbbe
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.
 
Usage:   bedtools merge [OPTIONS] -i <bed/gff/vcf>
 
Options: 
	-s	Force strandedness.  That is, only merge features
		that are the same strand.
		- By default, merging is done without respect to strand.
 
	-n	Report the number of BED entries that were merged.
		- Note: "1" is reported if no merging occurred.
 
	-d	Maximum distance between features allowed for features
		to be merged.
		- Def. 0. That is, overlapping & book-ended features are merged.
		- (INTEGER)
 
	-nms	Report the names of the merged features separated by commas.
		Change delim. with -delim.
 
	-scores	Report the scores of the merged features. Specify one of 
		the following options for reporting scores:
		  sum, min, max,
		  mean, median, mode, antimode,
		  collapse (i.e., print a semicolon-separated list),
		- (INTEGER)
 
	-delim	Specify a custom delimiter for the -nms and -scores concat options
		- Example: -delim "|"
		- Default: ",".
 
Notes: 
	(1) All output, regardless of input type (e.g., GFF or VCF)
	    will in BED format with zero-based starts
 
	(2) The input file (-i) file must be sorted by chrom, then start.

# merge overlapping fragments
infolder=Final_Results/bedtools_genomeCvg
outfolder=bedtools_genomeCvg
mkdir -p ${outfolder}
 
infile=chr21_nogap-nocvg_exonic_slop200.bed
outfile=chr21_nogap-nocvg_exonic_slop200_merged.bed
 
bedtools merge -i ${infolder}/${infile} > ${outfolder}/${outfile}  
 
# count slop'ped intervals without sequence after merging
wc -l ${infolder}/${outfile}
59 bedtools_genomeCvg/chr21_nogap-nocvg_exonic_slop200_merged.bed

From 70 we are now left with 59 gapped regions

extract fasta for the obtained fragments

bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file

bedtools getfasta command details

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.17.0-111-gab7bbbe
Summary: Extract DNA sequences into a fasta file based on feature coordinates.
 
Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> 
 
Options: 
	-fi	Input FASTA file
	-bed	BED/GFF/VCF file of ranges to extract from -fi
	-fo	Output file (can be FASTA or TAB-delimited)
	-name	Use the name field for the FASTA header
	-split	given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
	-tab	Write output in TAB delimited format.
		- Default is FASTA format.
 
	-s	Force strandedness. If the feature occupies the antisense,
		strand, the sequence will be reverse complemented.
		- By default, strand information is ignored.

We do twice the extraction here.

first from all regions including overlapping ones (creating partial duplicates).

# extract the hg19 merged sequences (some overlap!)
 
infolder=Final_Results/bedtools_genomeCvg
outfolder=bedtools_genomeCvg
mkdir -p ${outfolder}
 
infile=chr21_nogap-nocvg_exonic_slop200.bed
infasta=ref/HiSeq_UCSC_hg19.fa
outfasta=chr21_nogap-nocvg_exonic_slop200.fasta
 
bedtools getfasta -fi ${infasta} -bed ${infolder}/${infile} -fo ${outfolder}/${outfasta}  
 
# inspect the first two sequences
head -4 ${infolder}/${outfasta}  | sed -e "s/.\{60\}/&\n/g"

>chr21:9825629-9826062
TGGCCCCGCCTGGGACCGAACCCGGCACCGCCTCGTGGGGCGCCGCCGCCGGCCACTGAT
CGGCCCGGCGTCCGCGTCCCCCGGCGCGCGCCTTGGGGACCGGGTTGGTGGCGCCCCGCG
TGGGGCCCGGTGGGCTTCCCGGAGGGTTCCGGGGGTCGGCCTGCGGCGCGTGCGGGGGAG
GAGACGGTTCCGGGGGACCGGCCGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGA
TCGCCGAGGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCCGCCGGCGGCGGTGAGGCC
CCGCGCGTGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTC
CCCGCCGGCCGCCTTTCTCGCGCCTTCCCCGTCGCCCCGGCCTCGCCCGTGGTCTCTCGT
CTTCTCCCGGCCC
>chr21:9825696-9826307
GCGTCCGCGTCCCCCGGCGCGCGCCTTGGGGACCGGGTTGGTGGCGCCCCGCGTGGGGCC
CGGTGGGCTTCCCGGAGGGTTCCGGGGGTCGGCCTGCGGCGCGTGCGGGGGAGGAGACGG
TTCCGGGGGACCGGCCGCGACTGCGGCGGCGGTGGTGGGGGGAGCCGCGGGGATCGCCGA
GGGCCGGTCGGCCGCCCCGGGTGCCGCGCGGTGCCGCCGGCGGCGGTGAGGCCCCGCGCG
TGTGTCCCGGCTGCGGTCGGCCGCGCTCGAGGGGTCCCCGTGGCGTCCCCTTCCCCGCCG
GCCGCCTTTCTCGCGCCTTCCCCGTCGCCCCGGCCTCGCCCGTGGTCTCTCGTCTTCTCC
CGGCCCGCTCTTCCGAACCGGGTCGGCGCGTCCCCCGGGTGCGCCTCGCTTCCCGGGCCT
GCCGCGGCCCTTCCCCGAGGCGTCCGTCCCGGGCGTCGGCGTCGGGGAGAGCCCGTCCTC
CCCGCGTGGCGTCGCCCCGTTCGGCGCGCGCGTGCGCCCGAGCGCGGCCCGGTGGTCCCT
CCCGGACAGGCGTTCGTGCGACGTGTGGCGTGGGTCGACCTCCGCCTTGCCGGTCGCTCG
CCCTTTCCCCG

Look carefully, these two regions do overlap!

and now after collapsing overlapping regions to the largest overlap.

# extract the hg19 merged sequences (less but much longer)
infolder=bedtools_genomeCvg
infile=chr21_nogap-nocvg_exonic_slop200.bed
infile=chr21_nogap-nocvg_exonic_slop200_merged.bed
infasta=ref/HiSeq_UCSC_hg19.fa
outfasta=chr21_nogap-nocvg_exonic_slop200_merged.fasta
 
bedtools getfasta -fi ${infasta} -bed ${infolder}/${infile} -fo ${infolder}/${outfasta}  
 
# inspect the first two sequences
head -4 ${infolder}/${outfasta}  | sed -e "s/.\{60\}/&\n/g"

>chr21:9825629-9827176
TGGCCCCGCCTGGGACCGAACCCGGCACCGCCTCGTGGGGCGCCGCCGCCGGCCACTGAT
CGGCCCGGCGTCCGCGTCCCCCGGCGCGCGCCTTGGGGACCGGGTTGGTGGCGCCCCGCG
TGGGGCCCGGTGGGCTTCCCGGAGGGTTCCGGGGGTCGGCCTGCGGCGCGTGCGGGGGAG
...
GGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTAC
GCACGGCCGGTACAGTGAAACTGCGAATGGCTCATTAAATCAGTTATGGTTCCTTTGGTC
GCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAGAGCTAAT
>chr21:9900795-9920035
AAAGTTTTCTTAAATATGAGACCAGCAACAAGAATAAAAATTGATGAATTGGACTTCATC
AAAATTAAACATTTTTGCACTTCAAAGGACACCATCAAGAAAGTGAAAAGACAACTCACA
AGATGGAAGAAAATACTTGTAAATCATGTGTAGCACTTTTTGATGATAAACTTGAGGAAG
...
TCCACTCTGACCTGACCTTCAGTAGTTCCTTTATTTTTATTTTCAAGCATTTTTGTGAGT
ATATAATAAGCATATATATTTATGGGGTGCATGAGATGTTTTGATACAGGCATGCAGCAT
GAACTAATCACATCATGGAGAGTGGGTATCCATCCCTTTA

BED files created during this exercise can be visualized in IGV to spot these regions (we provide them from the bottom link

keep only refGene exonic CODING fragments

bedtools intersect will take the file reporting no-coverage regions not in gaps and extract from it all fragments overlapping RefGene CDS-exons taken from another BED file. This allows focusing on the CODING & exomic part of the un-sequenced genome

# keep only chr21_nogap-nocvg overlapping refGene CDS-exons
refgene=ref/merged_hg19_chr21-refGene-CDS.bed
infile=chr21_nogap-nocvg.bed
infolder=bedtools_genomeCvg
outfile=chr21_nogap-nocvg_CDS-exonic.bed
 
bedtools intersect -wa -a ${infolder}/${infile} -b ${refgene} | \
	sort -k 1V,1 -k 2n,2 -k 3n,3 | uniq > \
	${infolder}/${outfile}
 
# count them
wc -l ${infolder}/${outfile}

22 bedtools_genomeCvg/chr21_nogap-nocvg_CDS-exonic.bed

# and show some
head -5 ${infolder}/${outfile} | column -t

chr21  31913981  31913982  no-coverage  1
chr21  33245961  33246026  no-coverage  1
chr21  34399827  34399850  no-coverage  1
chr21  34399902  34400050  no-coverage  1
chr21  34442935  34442953  no-coverage  1

add 200 bps at either ends

bedtools slop will take the file obtained above and add 200bps to both ends to design PCR primers outside of the missing regions

# add 200bps either ends of the original bed file and remove duplicates
genome=ref/hg19.genome
infile=chr21_nogap-nocvg_CDS-exonic.bed
infolder=bedtools_genomeCvg
outfile=chr21_nogap-nocvg_CDS-exonic_slop200.bed
 
bedtools slop -b 200 -i ${infolder}/${infile} -g ${genome} > \
	${infolder}/${outfile}
 
head -5 ${infolder}/${outfile} | column -t

chr21  31913781  31914182  no-coverage  1
chr21  33245761  33246226  no-coverage  1
chr21  34399627  34400050  no-coverage  1
chr21  34399702  34400250  no-coverage  1
chr21  34442735  34443153  no-coverage  1

How many of the 22 remain after merging overlaps?

bedtools merge -i bedtools_genomeCvg/chr21_nogap-nocvg_CDS-exonic_slop200.bed | wc -l

This is reasonable, lets take those for further processing.

extract fasta for the obtained fragments

bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file

# extract the hg19 merged sequences (some overlap!)
infolder=bedtools_genomeCvg
infile=chr21_nogap-nocvg_CDS-exonic_slop200.bed
infasta=ref/HiSeq_UCSC_hg19.fa
outfasta=chr21_nogap-nocvg_CDS-exonic_slop200.fasta
 
bedtools getfasta -fi ${infasta} -bed ${infolder}/${infile} -fo ${infolder}/${outfasta}  
 
# inspect the first two sequences
head -4 ${infolder}/${outfasta}  | sed -e "s/.\{60\}/&\n/g"

>chr21:31913781-31914182
GACTAAATAATAATAGGTGAAACCCTGATTCAATGTAATGATTCCAGATTCTTCCATAAA
AACTGATGGCTAGTTCCTTCTGGAGCATGAAGCCAAAAAATGGTAGGTATTTTCTCTACA
TTGAGAAAAGTGTTCGCCTTGCTGAAGCATACGGGCAAGATCTTGGAGTTAGAGTATGAG
AATCAGCAAGTTTTTTAGTAGAATCCAGAGAATCCATATCCTTCACGGCATGATGGGCGG
CAGCAGCCATATCTATAGCCTCCATAGCCAGAGCCATATCTGTAGCCTCCACAGCCACAG
CCATAGCCCAGACCACCAAAGCCTCCACAGCCATATCCCAGGCCTCTGTAGTAGCTGCCA
TAGTATCTCATGGTGTCAGGGACTAGAGATTCAGTTCAATG
>chr21:33245761-33246226
GCGGGGTCCCCGGTCCCGCGTCCCCTGGGCAGCCGCTATTGTCTACGCGCCTCGCTGGGC
GGCGCGGGGGGCGTGATCGCGGCGGCCCCGGGCTCTGGGTGCGGAGACCCAGGCGGGGCT
GGGCCCAGGGCGGCGGCGGGAGAAGCCGGGGAAGCCGAAGAGCCTGGGGAGGAGGAGCTG
CGAGCGCGGGAGACGAGCAGGAGCCGCGCGGGCCGCGGCGAGCGCGATGCCGGCGGCGGC
GGGGGACGGGCTCCTGGGGGAGCCGGCGGCGCCTGGGGGCGGCGGCGGCGCGGAGGACGC
GGCCAGGCCCGCGGCGGCCTGCGAGGGAAGTTTCCTGCCTGCCTGGGTGAGCGGCGTGCC
CCGCGAGCGGCTCCGCGACTTCCAGCACCACAAGCGCGTGGGCAACTACCTCATCGGCAG
CAGGAAGCTGGGCGAGGGCTCCTTTGCCAAGGTGCGCGAGGGGCT

try the flank command and have a look in IGV

bedtools flank is also worth mentioning here to extract the borders of a given bed list; It can be used for both sides or for one side and for a defined length or the a fraction of the length as the query fragment. (eg: flank can be useful to extract boundaries and design primers outside of a set of regions to be re-sequenced)

bedtools flank command details

Tool:    bedtools flank (aka flankBed)
Version: v2.17.0-111-gab7bbbe
Summary: Creates flanking interval(s) for each BED/GFF/VCF feature.
 
Usage:   bedtools flank [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)]
 
Options: 
	-b	Create flanking interval(s) using -b base pairs in each direction.
		- (Integer) or (Float, e.g. 0.1) if used with -pct.
 
	-l	The number of base pairs that a flank should start from
		orig. start coordinate.
		- (Integer) or (Float, e.g. 0.1) if used with -pct.
 
	-r	The number of base pairs that a flank should end from
		orig. end coordinate.
		- (Integer) or (Float, e.g. 0.1) if used with -pct.
 
	-s	Define -l and -r based on strand.
		E.g. if used, -l 500 for a negative-stranded feature, 
		it will start the flank 500 bp downstream.  Default = false.
 
	-pct	Define -l and -r as a fraction of the feature's length.
		E.g. if used on a 1000bp feature, -l 0.50, 
		will add 500 bp "upstream".  Default = false.
 
	-header	Print the header from the input file prior to results.
 
Notes: 
	(1)  Starts will be set to 0 if options would force it below 0.
	(2)  Ends will be set to the chromosome length if requested flank would
	force it above the max chrom length.
	(3)  In contrast to slop, which _extends_ intervals, bedtools flank
	creates new intervals from the regions just up- and down-stream
	of your existing intervals.
	(4)  The genome file should tab delimited and structured as follows:
 
	<chromName><TAB><chromSize>
 
	For example, Human (hg19):
	chr1	249250621
	chr2	243199373
	...
	chr18_gl000207_random	4262
 
Tips: 
	One can use the UCSC Genome Browser's MySQL database to extract
	chromosome sizes. For example, H. sapiens:
 
	mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
	"select chrom, size from hg19.chromInfo"  > hg19.genome

######## bonus bedtools flank command ######################
# each line is one flank (left then right) respective to the initial segment
infolder=bedtools_genomeCvg
infile=chr21_nogap-nocvg_exonic.bed
outfile=chr21_nogap-nocvg_exonic_flank200.bed
 
bedtools flank -i ${infolder}/${infile} -g ${genome} -b 200 | \
	sort -k 1V,1 -k 2n,2 -k 3n,3 | uniq > \
	${infolder}/${outfile}
 
head -5 ${infolder}/${outfile} | column -t

chr21  9825629  9825829  no-coverage  1
chr21  9825696  9825896  no-coverage  1
chr21  9825862  9826062  no-coverage  1
chr21  9826000  9826200  no-coverage  1
chr21  9826107  9826307  no-coverage  1

In this last example (HUNK gene), the flanks are shown in exomic pairs in the lower track but only one of the pairs overlaps the coding part and is therefore used for the slop CDS fragment in the middle track.

Zooming in (chr21:33,245,874-33,246,120) to show the missing CDS

use data.bits URLs directly in IGV (no download required)

All BED files created during this exercise can be visualized in IGV to spot these regions and can even be loaded DIRECTLY within IGV from the following data.bits URLs

list of URL for IGV

Copy one of these links and paste it in IGV 'File>Load from URL'

http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic.bed
http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic_slop200.bed
http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic_slop200_merged.bed
http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_CDS-exonic.bed
http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_CDS-exonic_slop200.bed
http://data.bits.vib.be/pub/trainingen/NGSDNA2013/ex4b-files/chr21_nogap-nocvg_exonic_flank200.bed

Please read the doc to discover other bedtools, also become familiar with IGV, the combination is dynamite

download exercise files

Download exercise files here

Use the right application to open the files present in ex4b-files

References:

NGS Exercise.4b

Contents

introduction

convert the UCSC hg19 refGene BED12 transcript track to a BED6 exon track

download the list of refGene hg19 coding exons (CDS)

retrieve the reference sequence of un-sequenced exomic regions of chr21 (from BWA mem)

exclude fragments without reference sequence (Nregions=gaps)

keep only refGene exonic fragments

add 200 bps at either ends

merge overlapping fragments

extract fasta for the obtained fragments

keep only refGene exonic CODING fragments

add 200 bps at either ends

extract fasta for the obtained fragments

try the flank command and have a look in IGV

use data.bits URLs directly in IGV (no download required)

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox