[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.1 | NGS-Var2020 Exercise.3 ]

Align paired end reads to the human reference genome GRCh38 using the Burrow Wheeler Aligner (BWA)

Introduction

Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) is(are) stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.

Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).

prepare the reference genome for BWA alignment

BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using one of BWA commands.

This step was performed for you and the reference index saved to the GenePattern server under the name GRCh38.

Nine takeways about GRCh38 from a Broad page

Align the reads in pairs to the reference genome using the bwa mem algorithm

We will do this step using the smallest 1% sample and not the full data in order to speed up the process

start the BWA mem module
in the 'input' parameter group, link the reference and the two 1% read files (0.01) in the corresponding fields

review the optional settings but do not change the defaults

use the text below to fill the text fields:

https://data.bits.vib.be/pub/trainingen/NGSVAR2020/reads/HG001.GRCh38_chr22_0.01_1.fq.gz

https://data.bits.vib.be/pub/trainingen/NGSVAR2020/reads/HG001.GRCh38_chr22_0.01_2.fq.gz

@RG ID:HG001 LB:NA12878_giab PU:unknown-0.0 PL:Illumina SM:NA12878

HG001_1pc

run and wait for results, you should get a job as shown next

Prepare the mappings for variant calling with various Picard tools grouped in GATK.BamPreprocess

The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated Picard tool.

You can read more about read duplicates in our [Q&A section about read duplicates]

Read Name regex: Reads are expected to be named in the following convention machine:run:flowcell:lane:tile:x_coord:y_coord

start the GATK.BamPreprocess module
point the tool to the SAM file obtained after BWA mapping then review and adapt the other fields as shown

run and wait for the results

open the 'HG001_1pc.duplicatesmetrics.txt' summary text file to get numbers

The metrics are explained HERE

Results for the 1% run

## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates  --INPUT NA12878_0.01_1.fq.querysorted.bam --OUTPUT NA12878_0.01_1.fq.marked.bam --METRICS_FILE NA12878_0.01_1.fq.duplicatesmetrics.txt --ASSUME_SORT_ORDER queryname --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --TMP_DIR TMP --QUIET true --VALIDATION_STRINGENCY SILENT  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
## htsjdk.samtools.metrics.StringHeader
# Started on: Mon Dec 30 09:24:45 CET 2019
 
## METRICS CLASS	picard.sam.DuplicationMetrics
LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
NA12878_01pc	2284	340118	2599	2284	34	20	0	0.000108	2891892954
 
## HISTOGRAM	java.lang.Double
BIN	CoverageMult	all_sets	non_optical_sets
1.0	1	340078	340078
2.0	1.999882	20	20
3.0	2.999647	0	0
4.0	3.999294	0	0
5.0	4.998824	0	0
6.0	5.998236	0	0
7.0	6.997531	0	0
8.0	7.996708	0	0
9.0	8.995767	0	0
10.0	9.994709	0	0
11.0	10.993534	0	0
12.0	11.992241	0	0
13.0	12.990831	0	0
14.0	13.989303	0	0
15.0	14.987658	0	0
16.0	15.985895	0	0
17.0	16.984015	0	0
18.0	17.982018	0	0
19.0	18.979903	0	0
20.0	19.977671	0	0
21.0	20.975322	0	0
22.0	21.972855	0	0
23.0	22.970271	0	0
24.0	23.967569	0	0
25.0	24.964751	0	0
26.0	25.961815	0	0
27.0	26.958761	0	0
28.0	27.955591	0	0
29.0	28.952303	0	0
30.0	29.948898	0	0
31.0	30.945376	0	0
32.0	31.941737	0	0
33.0	32.937981	0	0
34.0	33.934107	0	0
35.0	34.930116	0	0
36.0	35.926008	0	0
37.0	36.921783	0	0
38.0	37.917441	0	0
39.0	38.912982	0	0
40.0	39.908405	0	0
41.0	40.903712	0	0
42.0	41.898902	0	0
43.0	42.893974	0	0
44.0	43.88893	0	0
45.0	44.883768	0	0
46.0	45.87849	0	0
47.0	46.873094	0	0
48.0	47.867582	0	0
49.0	48.861952	0	0
50.0	49.856206	0	0
51.0	50.850343	0	0
52.0	51.844362	0	0
53.0	52.838265	0	0
54.0	53.832051	0	0
55.0	54.82572	0	0
56.0	55.819273	0	0
57.0	56.812708	0	0
58.0	57.806027	0	0
59.0	58.799229	0	0
60.0	59.792314	0	0
61.0	60.785282	0	0
62.0	61.778133	0	0
63.0	62.770868	0	0
64.0	63.763486	0	0
65.0	64.755987	0	0
66.0	65.748371	0	0
67.0	66.740639	0	0
68.0	67.73279	0	0
69.0	68.724824	0	0
70.0	69.716742	0	0
71.0	70.708543	0	0
72.0	71.700228	0	0
73.0	72.691795	0	0
74.0	73.683246	0	0
75.0	74.674581	0	0
76.0	75.665799	0	0
77.0	76.6569	0	0
78.0	77.647885	0	0
79.0	78.638754	0	0
80.0	79.629505	0	0
81.0	80.620141	0	0
82.0	81.610659	0	0
83.0	82.601062	0	0
84.0	83.591347	0	0
85.0	84.581517	0	0
86.0	85.57157	0	0
87.0	86.561506	0	0
88.0	87.551326	0	0
89.0	88.54103	0	0
90.0	89.530617	0	0
91.0	90.520088	0	0
92.0	91.509442	0	0
93.0	92.49868	0	0
94.0	93.487802	0	0
95.0	94.476808	0	0
96.0	95.465697	0	0
97.0	96.45447	0	0
98.0	97.443126	0	0
99.0	98.431667	0	0
100.0	99.420091	0	0

You can make the top table (## METRICS CLASS section) much more readable by 'transposing it' under unix or with this nice webtool: https://www.browserling.com/tools/text-transpose

LIBRARY	NA12878_giab
UNPAIRED_READS_EXAMINED	229105
READ_PAIRS_EXAMINED	33980150
SECONDARY_OR_SUPPLEMENTARY_RDS	262641
UNMAPPED_READS	229109
UNPAIRED_READ_DUPLICATES	141832
READ_PAIR_DUPLICATES	151566
READ_PAIR_OPTICAL_DUPLICATES	3728
PERCENT_DUPLICATION	0.006525
ESTIMATED_LIBRARY_SIZE	3892930596

The covariate analysis results can be reviewed in the HG001_1pc.covariates.pdf and HG001_10pc.covariates.pdf PDF files.

download exercise files

Download exercise files here

Use the right application to open the files present in ex2-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.1 | NGS-Var2020 Exercise.3 ]

NGS-Var2020 Exercise.2

Contents

Introduction

prepare the reference genome for BWA alignment

Align the reads in pairs to the reference genome using the bwa mem algorithm

Prepare the mappings for variant calling with various Picard tools grouped in GATK.BamPreprocess

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox