NGS-Var2018 Exercise.2
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.1 | NGS-Var2018 Exercise.3 ]
Align paired end reads to the human reference genome GRCh38 using the Burrow Wheeler Aligner (BWA)
Contents
Introduction
Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) is(are) stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.
Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).
prepare the reference genome for BWA alignment
BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using one of BWA commands.
This step was performed for you and the reference index saved to the GenePattern server under the name GRCh38.
Align the reads in pairs to the reference genome using the bwa mem algorithm
We will do this step using the smallest 1% sample and not the full data in order to speed up the process
- start the BWA mem module
- in the 'input' parameter group, link the reference and the two 1% read files (0.01) in the corresponding fields
- review the optional settings but do not change the defaults
use the text below to fill the text fields:
@RG ID:HG001 LB:NA12878_giab PU:unknown-0.0 PL:Illumina SM:NA12878
HG001_1pc
- run and wait for results, you should get a job as shown next
Prepare the mappings for variant calling with various Picard tools grouped in GATK.BamPreprocess
The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated Picard tool.
You can read more about read duplicates in our [Q&A section about read duplicates]
Read Name regex: Reads are expected to be named in the following convention machine:run:flowcell:lane:tile:x_coord:y_coord
- start the GATK.BamPreprocess module
- point the tool to the SAM file obtained after BWA mapping then review and adapt the other fields as shown
- run and wait for the results
- open the 'HG001_1pc.duplicatesmetrics.txt' summary text file to get numbers
The metrics are explained HERE
Results for the 1% run
## htsjdk.samtools.metrics.StringHeader # MarkDuplicates --INPUT HG001_1pc.querysorted.bam --OUTPUT HG001_1pc.marked.bam --METRICS_FILE HG001_1pc.duplicatesmetrics.txt --ASSUME_SORT_ORDER queryname --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --TMP_DIR TMP --QUIET true --VALIDATION_STRINGENCY SILENT --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false ## htsjdk.samtools.metrics.StringHeader # Started on: Wed Oct 10 15:39:55 CEST 2018 ## METRICS CLASS picard.sam.DuplicationMetrics LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE NA12878_giab 2284 340118 2599 2284 34 20 0 0.000108 2891892954 ## HISTOGRAM java.lang.Double BIN VALUE 1.0 1 2.0 1.999882 3.0 2.999647 4.0 3.999294 5.0 4.998824 6.0 5.998236 7.0 6.997531 8.0 7.996708 9.0 8.995767 10.0 9.994709 11.0 10.993534 12.0 11.992241 13.0 12.990831 14.0 13.989303 15.0 14.987658 16.0 15.985895 17.0 16.984015 18.0 17.982018 19.0 18.979903 20.0 19.977671 21.0 20.975322 22.0 21.972855 23.0 22.970271 24.0 23.967569 25.0 24.964751 26.0 25.961815 27.0 26.958761 28.0 27.955591 29.0 28.952303 30.0 29.948898 31.0 30.945376 32.0 31.941737 33.0 32.937981 34.0 33.934107 35.0 34.930116 36.0 35.926008 37.0 36.921783 38.0 37.917441 39.0 38.912982 40.0 39.908405 41.0 40.903712 42.0 41.898902 43.0 42.893974 44.0 43.88893 45.0 44.883768 46.0 45.87849 47.0 46.873094 48.0 47.867582 49.0 48.861952 50.0 49.856206 51.0 50.850343 52.0 51.844362 53.0 52.838265 54.0 53.832051 55.0 54.82572 56.0 55.819273 57.0 56.812708 58.0 57.806027 59.0 58.799229 60.0 59.792314 61.0 60.785282 62.0 61.778133 63.0 62.770868 64.0 63.763486 65.0 64.755987 66.0 65.748371 67.0 66.740639 68.0 67.73279 69.0 68.724824 70.0 69.716742 71.0 70.708543 72.0 71.700228 73.0 72.691795 74.0 73.683246 75.0 74.674581 76.0 75.665799 77.0 76.6569 78.0 77.647885 79.0 78.638754 80.0 79.629505 81.0 80.620141 82.0 81.610659 83.0 82.601062 84.0 83.591347 85.0 84.581517 86.0 85.57157 87.0 86.561506 88.0 87.551326 89.0 88.54103 90.0 89.530617 91.0 90.520088 92.0 91.509442 93.0 92.49868 94.0 93.487802 95.0 94.476808 96.0 95.465697 97.0 96.45447 98.0 97.443126 99.0 98.431667 100.0 99.420091
You can make the top table (## METRICS CLASS section) much more readable by 'transposing it' under unix or with this nice webtool: https://www.browserling.com/tools/text-transpose
LIBRARY | NA12878_giab |
---|---|
UNPAIRED_READS_EXAMINED | 2284 |
READ_PAIRS_EXAMINED | 340118 |
SECONDARY_OR_SUPPLEMENTARY_RDS | 2599 |
UNMAPPED_READS | 2284 |
UNPAIRED_READ_DUPLICATES | 34 |
READ_PAIR_DUPLICATES | 20 |
READ_PAIR_OPTICAL_DUPLICATES | 0 |
PERCENT_DUPLICATION | 0.000108 |
ESTIMATED_LIBRARY_SIZE | 2891892954 |
The covariate analysis results can be reviewed in the HG001_1pc.covariates.pdf and HG001_10pc.covariates.pdf PDF files.
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.1 | NGS-Var2018 Exercise.3 ]