NGS variant analysis custom functions and code
[ back to Hands-on introduction to NGS variant analysis ]
Custom functions and pieces of code assembled to make your life easier when it comes to perform repetitive tasks. Information will be given during the course to use this efficiently. custom functions are edited and added to a **.myfunctions** file created in the home folder and sourced at startup in your **.bashrc** (aka .profile ...) to be always available in bash. Type **declare -F** to see all functions available on your machine (try **listfunc** too). In order to use custom functions within bash scripts, you will need to source the .myfunction file at the beginning of your script with '. $HOME/.myfunctions'.
Custom functions can take arguments just like shell scripts and can really be complex and very useful.
Contents
- 1 Picard .dict to .genome (Bedtools)
- 2 splitline
- 3 numcols
- 4 addchr | remchr
- 5 vcf2index
- 6 vcf2mnp-indels.sh
- 7 split VCF by type
- 8 Creating colorfull VENN plots in R
- 9 filter unpaired reads from a - read-name sorted - BAM file (bam_re-pair.pl)
- 10 Creating a 10% subset from a BAM file with **samtools**
- 11 Get chromosome length from UCSC for the human genome
Picard .dict to .genome (Bedtools)
Takes a .dict file created by Picard and makes a text chromosome name + length file required for Bedtools
function dict2genome { # create .genome file from PICARD .dict file echo "# creating ${1%\.dict}.genome" grep "^@SQ" ${1} | awk '{split($2,name,":"); split($3,len,":"); print name[2]"\t"len[2]}' > ${1%\.dict}.genome }
splitline
Split lines after a given length to better render them for inclusion in web-documents. The argument defines the length of the default value of 60 is used. This command can be piped after another command to format the output.
splitline () { # split lines in pipe after n char (default=60) # usage command | command | splitline 70 # the € char is used to temporarily convert tabs to 1-char length # tabs are restored at the end lim=${1:-60} tr "\t" "€" | sed -e "s/.\{${lim}\}/&\n/g" | tr "€" "\t" }
As always, there is a command called fold to 'Wrap input lines to fit in specified width' in the GNU coreutils (http://www.gnu.org/software/coreutils/).
fold invocation
Usage: fold [OPTION]... [FILE]... Wrap input lines in each FILE (standard input by default), writing to standard output. Mandatory arguments to long options are mandatory for short options too. -b, --bytes count bytes rather than columns -s, --spaces break at spaces -w, --width=WIDTH use WIDTH columns instead of 80 --help display this help and exit --version output version information and exit type "info coreutils 'fold invocation'" for more examples
numcols
Take a line (1st by default) of a table and output columns as lines with leading number
function numcols () { head -${2:-1} ${1} | tail -1 | transpose -t | awk '{print NR,$0}' }
addchr | remchr
A function to add / remove 'chr' in chromosome names (hg19 vs B37 builds) from **VCF** or **BED** files (provided as unique argument)
function addchr() { # add chr in front of line starting with 0-9XYM # should work with most BED and VCF data if [ "$#" -lt 1 ] then cat | gawk 'BEGIN{FS="\t"; OFS="\t"} {if ($1~/^MT/){ gsub("MT","chrM",$1); print $0 } else if ($1~/^[0-9XYM]/) {gsub($1,"chr"$1,$1); print $0 } else print $0 }' else gawk 'BEGIN{FS="\t"; OFS="\t"} {if ($1~/^MT/){ gsub("MT","chrM",$1); print $0 } else if ($1~/^[0-9XYM]/) {gsub($1,"chr"$1,$1); print $0 } else print $0 }' $1 fi }
function remchr() { # removes 'chr" from front of lines starting with chr # should work with BED and VCF if [ "$#" -lt 1 ] then cat | gawk 'BEGIN{FS="\t"; OFS="\t"} {if ($1~/^chrM/){ gsub("chrM","MT",$1); print $0 } else if ($1~/^chr/){ gsub("chr","",$1); print $0 } else print $0 }' else gawk 'BEGIN{FS="\t"; OFS="\t"} {if ($1~/^chrM/){ gsub("chrM","MT",$1); print $0 } else if ($1~/^chr/){ gsub("chr","",$1); print $0 } else print $0 }' $1 fi }
vcf2index
This custom bash function performs a frequent operation consisting in sorting by chromosome and position, compressing, and indexing a vcf file.
function vcf2index () { # accepts both piped content or a vcf file as input # writes header through and sorts remaining lines # compresses with bgzip, indexes with tabix # and cleans up temp file (piped version) if [ $# -eq 1 ]; then title=$(basename "$1" ".vcf"); ( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/"; grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | \ bgzip -c > $1".gz" && tabix -p vcf $1".gz"; else cat > .tmpf; outfile="sorted-data.vcf.gz"; ( grep ^"#" .tmpf; grep -v ^"#" .tmpf | sort -k 1V,1 -k 2n,2 ) | \ bgzip -c > ${outfile} && ( rm .tmpf; tabix -p vcf ${outfile} ); fi }
Alternatively, you can type the following bash command to mimic it in terminal for each new VCF file you need to process ...
grep ^"#" file.vcf | grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 | bgzip -c > \ file.vcf.gz && tabix -p vcf file.vcf.gz"
vcf2mnp-indels.sh
This custom script splits a compressed vcf-file into MNP variants and indels as these variants are often analyzed separately.
#!/bin/bash # split vcf file into MNP and indel fractions # SP-BITS: added '.' to allowed one nucleotide ins or del call # awk code from: Pierre Lindenbaum # http://biostar.stackexchange.com/questions/7426/ # how-to-separate-snp-variants-from-indel-variants-in-the-same-vcf-file # # Stéphane Plaisance - VIB-BITS - May-07-2013 v1.0 # check parameters if [[ $# != 1 ]] then echo "Usage: ${0##*/} variants.vcf.gz"; exit 1 fi pref=$(basename $1 ".vcf.gz") gzip -cd $1 | gawk -v pref=${pref} ' /.*/ { # wave header to both files if ($0~/^#/) { print $0 > pref"-snv.vcf"; print $0 > pref"-indels.vcf"; # both call fields are one nucleotide long or '.' } else if ($0~/^[^\t]+\t[0-9]+\t[^\t]*\t[atgcATGC\.]\t[a-zA-Z\.]\t/) { print $0 > pref"-snv.vcf"; # one+ call is longer than 1 base } else { print $0 > pref"-indels.vcf"; } }' # compress results with bgzip and index with tabix bgzip ${pref}"-snv.vcf" && tabix -p vcf ${pref}"-snv.vcf.gz" bgzip ${pref}"-indels.vcf" && tabix -p vcf ${pref}"-indels.vcf.gz"
A couple of recent vcftools additional commands can be substituted to this script. Please follow <http://vcftools.sourceforge.net/options.html> for more vcftools syntax information. Include or exclude sites that contain an indel. For this option 'indel' means any variant that alters the length of the REF allele.
split VCF by type
# split VCF in SNVs and Indels with vcftools (recent version only!) # extraction adds .recode to the name, the second command removes it # by using a powerful bash string manipulation operation infile=some-file.vcf.gz vcftools --gzvcf ${infile} --remove-indels --out SNPs_${infile} --recode && \ mv SNPs_${infile} SNPs_${infile/\.recode/} && \ vcf2index SNPs_${infile} vcftools --gzvcf file.vcf.gz --keep-only-indels --out INDELs_file --recode && \ mv INDELs_${infile} INDELs_${infile/\.recode/} \ vcf2index INDELs_${infile}
Creating colorfull VENN plots in R
A recent package allows creating venn plots from the list of counts in the distinct regions. Such lists are obtained when running vcf-compare and can easily be fed to R to plot the results.
The complete documentation for that package can be reached on CRAN <http://cran.r-project.org/web/packages/colorfulVennPlot/>
Scripts are provided to create 2D, 3D, and 4D plots on our Github repository https://github.com/BITS-VIB/venn-tools[1]. Users can run these scripts (made executable) and specify each Venn count and additional parameters to obtain grey or color plots to insert in their reports. A typical call for a 4-way plot would be:
4DVenn.R -a 1 -b 2 -c 3 -d 4 -e 5 -f 6 -G 7 -i 8 -j 9 \ -k 10 -l 11 -m 12 -n 13 -p 14 -q 15 -f "4way-Venn"\ -A "grA" -B "grB" -C "grC" -D "grD" \ -o "my_4wayVenn" -u 1 -x 1
4Dvenn.R script
#!/usr/bin/RScript # plots a 4D VENN from precomputed data # usage: 4Dvenn.R -a .. to .. -u (see below) # counts are expected in the order # "0100","0010","0110",1100", # "0011","1000","1110","0111", # "0001","1111","1010","0101", # "1001" # the order is arbitrary, from top to bottom and from left to right # extra arguments: A-label B-label C-label D-label Title (opt) # example command: # 4DVenn.R -a 1 -b 2 -c 3 -d 4 -e 5 -f 6 -G 7 -i 8 -j 9 # -k 10 -l 11 -m 12 -n 13 -p 14 -q 15 # -A "grA" -B "grB" -C "grC" -D "grD" -t "4way-Venn" # -o "my_4wayVenn" -u 1 -x 1 # # Stephane Plaisance VIB-BITS April-11-2014 v1.0 # required R-packages # once only install.packages("grid") # once only install.packages("optparse") # once only install.packages("colorfulVennPlot") suppressPackageStartupMessages(library("optparse")) suppressPackageStartupMessages(library("grid")) suppressPackageStartupMessages(library("colorfulVennPlot")) ##################################### ### Handle COMMAND LINE arguments ### ##################################### # parameters option_list <- list( make_option(c("-a", "--a.count"), type="integer", default=0, help="counts for A-only [default: %default]"), make_option(c("-b", "--b.count"), type="integer", default=0, help="counts for B-only [default: %default]"), make_option(c("-c", "--c.count"), type="integer", default=0, help="counts for C-only [default: %default]"), make_option(c("-d", "--d.count"), type="integer", default=0, help="counts for D-only [default: %default]"), make_option(c("-e", "--ab.count"), type="integer", default=0, help="counts for AB-intersect [default: %default]"), make_option(c("-f", "--ac.count"), type="integer", default=0, help="counts for ACB-intersect [default: %default]"), make_option(c("-G", "--ad.count"), type="integer", default=0, help="counts for AD-intersect (! use G and not g, due to a bug in RScript) [default: %default]"), make_option(c("-j", "--bc.count"), type="integer", default=0, help="counts for BC-intersect [default: %default]"), make_option(c("-k", "--bd.count"), type="integer", default=0, help="counts for BD-intersect [default: %default]"), make_option(c("-l", "--cd.count"), type="integer", default=0, help="counts for CD-intersect [default: %default]"), make_option(c("-m", "--abc.count"), type="integer", default=0, help="counts for ABC-intersect [default: %default]"), make_option(c("-n", "--abd.count"), type="integer", default=0, help="counts for ABD-intersect [default: %default]"), make_option(c("-p", "--acd.count"), type="integer", default=0, help="counts for ACD-intersect [default: %default]"), make_option(c("-q", "--bcd.count"), type="integer", default=0, help="counts for BCD-intersect [default: %default]"), make_option(c("-i", "--abcd.count"), type="integer", default=0, help="counts for ABCD-intersect [default: %default]"), make_option(c("-A", "--a.label"), type="character", default="A", help="label for A [default: %default]"), make_option(c("-B", "--b.label"), type="character", default="B", help="label for B [default: %default]"), make_option(c("-C", "--c.label"), type="character", default="C", help="label for C [default: %default]"), make_option(c("-D", "--d.label"), type="character", default="D", help="label for D [default: %default]"), make_option(c("-t", "--title"), type="character", help="Graph Title [default: null]"), make_option(c("-x", "--format"), type="integer", default=1, help="file format for output 1:PNG, 2:PDF [default: %default]"), make_option(c("-o", "--file"), type="character", default="4Dvenn", help="file name for output [default: %default]"), make_option(c("-u", "--fill"), type="character", default="3", help="fill with 1:colors, 2:greys or 3:white [default: %default]") ) # PARSE OPTIONS opt <- parse_args(OptionParser(option_list=option_list)) # check if arguments provided if (length(opt) > 1) { # data y=c(opt$b.count, opt$c.count, opt$bc.count, opt$ab.count, opt$cd.count, opt$a.count, opt$abc.count, opt$bcd.count, opt$d.count, opt$abcd.count, opt$ac.count, opt$bd.count, opt$acd.count, opt$abd.count, opt$ad.count) names(y) <- c("0100","0010", "0110", "1100","0011", "1000","1110","0111","0001", "1111", "1010","0101", "1011","1101", "1001") # labels labels <- c(opt$a.label, opt$b.label, opt$c.label, opt$d.label) # colors (15) ncol <- 15 my.colors <- rainbow(ncol) my.greys <- rev(gray(0:32 / 32))[1:ncol] my.whites <- rep("#FFFFFF", ncol) my.fill <- ifelse( rep(opt$fill=="1", ncol), my.colors, (ifelse(rep(opt$fill=="2",ncol), my.greys, my.whites)) ) # title my.title <- ifelse(!is.null(opt$title), opt$title, "") # format if (opt$format==1){ # png filename <- paste(opt$file, ".png", sep="") png(file = filename, bg = "transparent") } else { # pdf filename <- paste(opt$file, ".pdf", sep="") pdf(file = filename, bg = "white") } # plot plot.new() plotVenn4d(y, labels, Colors = my.fill, Title = my.title) dev.off() }
filter unpaired reads from a - read-name sorted - BAM file (bam_re-pair.pl)
#!/usr/bin/perl -w # filter unpaired reads from a - read-name sorted - BAM file # title: bam_re-pair.pl # version: 2016-02-18 (added support for /1 or /2 after read name) # author: Stephane Plaisance (translated from python version by Devon Ryan) # http://seqanswers.com/forums/showthread.php?p=118936#post118936 # usage: # samtools view -h <name_sorted.bam> | \ # bam_re-pair.pl | \ # samtools view -bSo <name_sorted.filtered.bam> - use warnings; use strict; # variables my $read = ""; my $read1 = "none"; my $read2 = "none"; my $name1 = "none"; my $name2 = "none"; my ($ln,$ok,$no)=(0,0,0); while (my $read = <>) { # forward header lines if ($read =~ /^@/){ print STDOUT $read; next; } # process data $ln++; if( $name1 eq "none" ){ $read1 = $read; $name1 = (split("\t", $read1))[0]; } else { $name2 = (split("\t", $read))[0]; if( $name1 eq $name2 ){ # is paired $ok++; # add index to read names if absent if ($name1 !~ /\/1$/){ $read1 =~ s/$name1/$name1\/1/; } if ($name2 !~ /\/2$/){ $read =~ s/$name2/$name2\/2/; } print STDOUT sprintf("%s%s", $read1, $read); $read1 = "none"; $name1 = "none"; } else { # is not paired $no++; $read1 = $read; $name1 = (split("\t", $read1))[0]; } } } # report counts with nice alignmenment print STDERR "\n############################\n# Results\n"; print STDERR sprintf("%-18s%10d\n", "# processed:", $ln); print STDERR sprintf("%-18s%10d\n", "# passed-pairs:", $ok); print STDERR sprintf("%-18s%10d\n", "# failed-reads:", $no); exit 0;
The pieces of code below are alternative ways of getting data that were removed from the main text and are assembled here as a source of inspiration for the users. Beware that some command may seem to lead to the same result but in fact present slight functional differences.
Creating a 10% subset from a BAM file with **samtools**
#! /usr/bin/env bash # samtools Version: 0.1.19+ data=data-folder bamfile=mappings result=bam_subset mkdir -p ${result} # '0.1' will keep 10% of the bam records samtools view -s 0.1 -b \ ${data}/${bamfile}.bam \ > ${data}/samtools-sample_${bamfile}.bam \ 2>${data}/samtools-sample.err # bam data is often better sorted and indexed samtools sort ${data}/samtools-sample_${bamfile}.bam \ ${result}/srt_samtools-sample_${bamfile} && \ samtools index ${result}/srt_samtools-sample_${bamfile}.bam
Get chromosome length from UCSC for the human genome
# hg19 (similar to GRCh37) mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo order by chrom" > hg19.chromSizes.txt # GRCh38 => hg38 and not hg20 as one could expect (yes we can!) mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg38.chromInfo order by chrom" > hg38.chromSizes.txt
[ back to Hands-on introduction to NGS variant analysis ]