NGS-Var2020 Exercise.2
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.1 | NGS-Var2020 Exercise.3 ]
Align paired end reads to the human reference genome GRCh38 using the Burrow Wheeler Aligner (BWA)
Contents
Introduction
Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) is(are) stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.
Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).
prepare the reference genome for BWA alignment
BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using one of BWA commands.
This step was performed for you and the reference index saved to the GenePattern server under the name GRCh38.
Align the reads in pairs to the reference genome using the bwa mem algorithm
We will do this step using the smallest 1% sample and not the full data in order to speed up the process
- start the BWA mem module
- in the 'input' parameter group, link the reference and the two 1% read files (0.01) in the corresponding fields
- review the optional settings but do not change the defaults
use the text below to fill the text fields:
https://data.bits.vib.be/pub/trainingen/NGSVAR2020/reads/HG001.GRCh38_chr22_0.01_1.fq.gz
https://data.bits.vib.be/pub/trainingen/NGSVAR2020/reads/HG001.GRCh38_chr22_0.01_2.fq.gz
@RG ID:HG001 LB:NA12878_giab PU:unknown-0.0 PL:Illumina SM:NA12878
HG001_1pc
- run and wait for results, you should get a job as shown next
Prepare the mappings for variant calling with various Picard tools grouped in GATK.BamPreprocess
The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated Picard tool.
You can read more about read duplicates in our [Q&A section about read duplicates]
Read Name regex: Reads are expected to be named in the following convention machine:run:flowcell:lane:tile:x_coord:y_coord
- start the GATK.BamPreprocess module
- point the tool to the SAM file obtained after BWA mapping then review and adapt the other fields as shown
- run and wait for the results
- open the 'HG001_1pc.duplicatesmetrics.txt' summary text file to get numbers
The metrics are explained HERE
Results for the 1% run
## htsjdk.samtools.metrics.StringHeader # MarkDuplicates --INPUT NA12878_0.01_1.fq.querysorted.bam --OUTPUT NA12878_0.01_1.fq.marked.bam --METRICS_FILE NA12878_0.01_1.fq.duplicatesmetrics.txt --ASSUME_SORT_ORDER queryname --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --TMP_DIR TMP --QUIET true --VALIDATION_STRINGENCY SILENT --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false ## htsjdk.samtools.metrics.StringHeader # Started on: Mon Dec 30 09:24:45 CET 2019 ## METRICS CLASS picard.sam.DuplicationMetrics LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE NA12878_01pc 2284 340118 2599 2284 34 20 0 0.000108 2891892954 ## HISTOGRAM java.lang.Double BIN CoverageMult all_sets non_optical_sets 1.0 1 340078 340078 2.0 1.999882 20 20 3.0 2.999647 0 0 4.0 3.999294 0 0 5.0 4.998824 0 0 6.0 5.998236 0 0 7.0 6.997531 0 0 8.0 7.996708 0 0 9.0 8.995767 0 0 10.0 9.994709 0 0 11.0 10.993534 0 0 12.0 11.992241 0 0 13.0 12.990831 0 0 14.0 13.989303 0 0 15.0 14.987658 0 0 16.0 15.985895 0 0 17.0 16.984015 0 0 18.0 17.982018 0 0 19.0 18.979903 0 0 20.0 19.977671 0 0 21.0 20.975322 0 0 22.0 21.972855 0 0 23.0 22.970271 0 0 24.0 23.967569 0 0 25.0 24.964751 0 0 26.0 25.961815 0 0 27.0 26.958761 0 0 28.0 27.955591 0 0 29.0 28.952303 0 0 30.0 29.948898 0 0 31.0 30.945376 0 0 32.0 31.941737 0 0 33.0 32.937981 0 0 34.0 33.934107 0 0 35.0 34.930116 0 0 36.0 35.926008 0 0 37.0 36.921783 0 0 38.0 37.917441 0 0 39.0 38.912982 0 0 40.0 39.908405 0 0 41.0 40.903712 0 0 42.0 41.898902 0 0 43.0 42.893974 0 0 44.0 43.88893 0 0 45.0 44.883768 0 0 46.0 45.87849 0 0 47.0 46.873094 0 0 48.0 47.867582 0 0 49.0 48.861952 0 0 50.0 49.856206 0 0 51.0 50.850343 0 0 52.0 51.844362 0 0 53.0 52.838265 0 0 54.0 53.832051 0 0 55.0 54.82572 0 0 56.0 55.819273 0 0 57.0 56.812708 0 0 58.0 57.806027 0 0 59.0 58.799229 0 0 60.0 59.792314 0 0 61.0 60.785282 0 0 62.0 61.778133 0 0 63.0 62.770868 0 0 64.0 63.763486 0 0 65.0 64.755987 0 0 66.0 65.748371 0 0 67.0 66.740639 0 0 68.0 67.73279 0 0 69.0 68.724824 0 0 70.0 69.716742 0 0 71.0 70.708543 0 0 72.0 71.700228 0 0 73.0 72.691795 0 0 74.0 73.683246 0 0 75.0 74.674581 0 0 76.0 75.665799 0 0 77.0 76.6569 0 0 78.0 77.647885 0 0 79.0 78.638754 0 0 80.0 79.629505 0 0 81.0 80.620141 0 0 82.0 81.610659 0 0 83.0 82.601062 0 0 84.0 83.591347 0 0 85.0 84.581517 0 0 86.0 85.57157 0 0 87.0 86.561506 0 0 88.0 87.551326 0 0 89.0 88.54103 0 0 90.0 89.530617 0 0 91.0 90.520088 0 0 92.0 91.509442 0 0 93.0 92.49868 0 0 94.0 93.487802 0 0 95.0 94.476808 0 0 96.0 95.465697 0 0 97.0 96.45447 0 0 98.0 97.443126 0 0 99.0 98.431667 0 0 100.0 99.420091 0 0
You can make the top table (## METRICS CLASS section) much more readable by 'transposing it' under unix or with this nice webtool: https://www.browserling.com/tools/text-transpose
LIBRARY | NA12878_giab |
---|---|
UNPAIRED_READS_EXAMINED | 229105 |
READ_PAIRS_EXAMINED | 33980150 |
SECONDARY_OR_SUPPLEMENTARY_RDS | 262641 |
UNMAPPED_READS | 229109 |
UNPAIRED_READ_DUPLICATES | 141832 |
READ_PAIR_DUPLICATES | 151566 |
READ_PAIR_OPTICAL_DUPLICATES | 3728 |
PERCENT_DUPLICATION | 0.006525 |
ESTIMATED_LIBRARY_SIZE | 3892930596 |
The covariate analysis results can be reviewed in the HG001_1pc.covariates.pdf and HG001_10pc.covariates.pdf PDF files.
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.1 | NGS-Var2020 Exercise.3 ]