Create a GC content track
[ Main_Page | NGS_data_analysis ]
Create a custom 'GC'% track for your reference genome
(A hg19 GC track can be loaded from the IGV server but only for a 5bps averaging window )
Contents
- 1 Aim - create a custom GC percent genome wide track
- 2 Method
- 2.1 Download the yeast reference genome data from the UCSC table repository
- 2.2 create a N-base wide step file from the chromosome list with Bedtools makewindows
- 2.3 compute base frequencies across all bins with BedTools nuc
- 2.4 extract GC% results to a new IGV-formatted file for visualization
- 2.5 create a binary version with IGVtools for fast streaming
- 3 Results as visualized in IGV
- 4 Conclusion
Aim - create a custom GC percent genome wide track
Local excess of GC or AT bases will result in altered pairing of the DNA molecule and introduce bias in analysis or experimental setup that rely on DNA hybridozation (oligo annealing, denaturation, alignement). It can be useful to identify local unbalance in base composition or display it in a genome context. The short howto creates a GC% track for th full human genome (hg19) but can be used as guide for any other genome for which you know the sequence.
this page relates to Create a mappability track
Method
Download the yeast reference genome data from the UCSC table repository
- UCSC data can be obtained from their FTP server http://hgdownload.soe.ucsc.edu/goldenPath/hg19[1]
- other files are built from the table browser with interaction in your browser
We use below Kent tools (unix version from [2], also available for macOSX[3]) used by the UCSC team to produce the Genome browser system. Some of these tools should be present on your machine in order to repeat the code below.
If you do not have Kent tools installed, or if your reference genome is absent from the UCSC repository, You can download it in Fasta format. This means that you can apply this method to contigs or BACs provided you have their sequence - if you choose to download the fasta sequence, please proceed to the next section
# create a folder to store all results basefolder=$HOME"/Desktop/gcpercent" mkdir –p ${basefolder} # download the 2bit file wget -P ${basefolder} http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit # extract the fasta data with Kent's 'twoBitToFa' twoBitToFa ${basefolder}/hg19.2bit ${basefolder}/hg19.fa # inspect fasta headers grep "^>" ${basefolder}/hg19.fa
>chr1 >chr2 >chr3 >chr4 ... >chrUn_gl000231 >chrUn_gl000229 >chrM >chrUn_gl000226 >chr18_gl000207_random
# also create a two-column list of chromosomes names and sizes with Kent's 'faSize' faSize ${basefolder}/hg19.fa -detailed > ${basefolder}/hg19.sizes
alternate method to get sequence sizes without Kent tools
# generate a fasta index (including sizes in second column) samtools faidx ${basefolder}/hg19.fa # extract sequence names (chromosomes) and their lengths to a new file cut -f 1,2 ${basefolder}/hg19.fa.fai > ${basefolder}/hg19.sizes
create a N-base wide step file from the chromosome list with Bedtools makewindows
The scanning method chosen here uses a stepping window but you can also generate sliding windows using Bedtools [4] (read the DOC[5]).
# create a folder to store all results basefolder=$HOME"/Desktop/gcpercent" mkdir –p ${basefolder} # make 1kb wide bins width=1000 # create a BED file with windows of wished width bedtools makewindows \ -g ${basefolder}/hg19.sizes \ -w ${width} \ > ${basefolder}/hg19_${width}bps.bed
compute base frequencies across all bins with BedTools nuc
The following information will be reported after each BED entry by bedtools nuc
command help page
Tool: bedtools nuc (aka nucBed) Version: v2.19.1 Summary: Profiles the nucleotide content of intervals in a fasta file. Usage: bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> Options: -fi Input FASTA file -bed BED/GFF/VCF file of ranges to extract from -fi -s Profile the sequence according to strand. -seq Print the extracted sequence -pattern Report the number of times a user-defined sequence is observed (case-sensitive). -C Ignore case when matching -pattern. By defaulty, case matters. -fullHeader Use full fasta header. - By default, only the word before the first space or tab is used. Output format: The following information will be reported after each BED entry: 1) %AT content 2) %GC content 3) Number of As observed 4) Number of Cs observed 5) Number of Gs observed 6) Number of Ts observed 7) Number of Ns observed 8) Number of other bases observed 9) The length of the explored sequence/interval. 10) The seq. extracted from the FASTA file. (opt., if -seq is used) 11) The number of times a user's pattern was observed. (opt., if -pattern is used.)
# create a folder to store all results basefolder=$HOME"/Desktop/gcpercent" mkdir –p ${basefolder} bedtools nuc \ -fi ${basefolder}/hg19.fa \ -bed ${basefolder}/hg19_${width}bps.bed \ > ${basefolder}/hg19_nuc_${width}.txt
extract GC% results to a new IGV-formatted file for visualization
To obtain a valid .igv format, we need the classical 'chr', 'start', and 'end' columns together with a 'title' column and the GC % values from column #5 of the bedtools nuc output. Our best friend Awk takes care of providing all this in no time.
# width was defined earlier and was set to 1000, hence the title column filled with 'GCpc_1000bps' gawk -v w=${width} 'BEGIN{FS="\t"; OFS="\t"} { if (FNR>1) {print $1,$2,$3,"GCpc_"w"bps",$5} }' ${basefolder}/hg19_nuc_${width}.txt > ${basefolder}/hg19_${width}bps.igv
create a binary version with IGVtools for fast streaming
For a large reference genome and/or a small width, the IGV file will grow quite big. It is a good practice to use igvtools to derive a .tdf version thereof for rapid visualization and streaming in IGV
igvtools command help page
Command "toTDF" --------------------------------------------------------------------------- The "toTDF" command converts a sorted data input file to a binary tiled data (.tdf) file. Input file formats supported are .wig, .cn, .igv, and .gct, TCGA mage-tab files, and "list" files. List files are text files containing a list of files in one of the supported formats, one file per line. When using a list file the format of the contained files must be specified explicitly with the "fileType" parameter. List files must end with the extension ".list". File paths can be absolute or relative to the directory containing the list file. Usage: igvtools toTDF [options] [inputFile] [outputFile] [genome] Required arguments: inputFile The input file (see supported formats above). outputFile Binary output file. Must end in ".tdf". genome A genome id or filename. See details below. Default is hg18. Options: -z, --maxZoom num Specifies the maximum zoom level to precompute. The default value is 7 and is sufficient for most files. To reduce file size at the expense of IGV performance this value can be reduced. -f, --windowFunctions list A comma delimited list specifying window functions to use when reducing the data to precomputed tiles. Allowed values are min, max, mean, median, p2, p10, p90, and p98. The "p" values represent percentile, so p2=2nd percentile, etc. -p, --probeFile file Specifies a "bed" file to be used to map probe identifiers to locations. This option is useful when preprocessing gct files. The bed file should contain 4 columns: chr start end name where name is the probe name in the gct file. --fileType Explicitly specify the file type. This is a required parameter for TCGA mage-tab and ".list" files. Possible values are mage-tab, .wig, .cn, .igv, and .gct. Only mage-tab files downloaded from the TCGA data center or related sights are supported at this time. Conversion of ".gct" and "mage-tab" files results in the creation of an ".igv" file, which is sorted by genome position using the "sort" command. For this case the following optional parameters can be specified. -t, --tmpDir tmpdir Specify a temporary working directory. For large input files this directory will be used to store intermediate results of the sort. The default is the users temp directory. -m, --maxRecords number The maximum number of records to keep in memory during the sort. The default value is 500000. Increase this number if you receive "too many open files" errors. Decrease it if you experience "out of memory" errors. Example: igvtools toTDF -z 5 copyNumberFile.cn copyNumberFile.tdf hg18 Notes: Data file formats, with the exception of .gct files, must be sorted by start position. If necessary files can be sorted with the "sort" command described below. Attempting to preprocess an unsorted file will result in an error. --------------------------------------------------------------------------- See http://www.broadinstitute.org/software/igv/igvtools_commandline, or igvtools_readme.txt, for more help Done
#!/usr/bin/bash # the '-z' (--maxZoom) value decides from which zoom level data will become visible (default=7) # the '-f' allowed values are: # min, max, mean, median, p2, p10, p90, and p98. # the "p" values represent percentile, so p2=2nd percentile, igvtools toTDF \ -z 5 \ -f min,max,mean \ ${basefolder}/hg19_${width}bps.igv \ ${basefolder}/hg19_${width}bps.tdf \ hg19 # replace hg19 by your-genome.fasta
Results as visualized in IGV
A screen-shot is provided to illustrate the results using IGV and the hg19 loaded genome. The chr4 telomeric region including the Dux2 / Dux4 region is selected as being a known GC rich spot. Several visualizations are presented using histogram, line or heatmap. Use the one that best fits your needs.
Conclusion
It is obvious that many other tools can achieve this goal, we selected tools that are quite popular, easy to install and performant as a matter of execution time. Feed back on this page is welcome and can be sent to mailto:bits@vib.be
References:
- ↑ http://hgdownload.soe.ucsc.edu/goldenPath/hg19
- ↑ http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64
- ↑ http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386"
- ↑ https://github.com/arq5x/bedtools2.git
- ↑ http://bedtools.readthedocs.org/en/latest/
[ Main_Page | NGS_data_analysis ]