Archive for Module 5
Go to parent Introduction to Bioinformatics#Exercises_during_the_training
Contents
The Ensembl Genome browser
Exploring a region
TASKS
(a) Go to the region from bp 52,600,000 to 53,300,000 on human
chromosome 4. How many contigs make up this portion of the assembly
(contigs are contiguous stretches of DNA sequence that have been
assembled solely based on direct sequencing information)? What does the
open (i.e. non-blue) part in the ‘Contigs’ track represent?
(b) Do the tilepath clones (i.e. the BAC clones that were sequenced for the
human genome assembly) correspond with the contigs?
(c) Zoom in on the SGCB transcript, including a bit of flanking sequence on
both sides.
(d) CpG islands are genomic regions that contain a high frequency of CG
dinucleotides and are often located near the promoter of mammalian genes.
Is there a CpG island located at the 5’ end of the SGCB transcript? Possible
transcription start sites can be predicted using the Eponine program
(http://www.sanger.ac.uk/Software/analysis/eponine/). Is there a transcription
start site predicted by Eponine annotated for the SGCB transcript?
(e) Export the genomic sequence of the region you are looking at in FASTA
format.
(f) If you have yourself a genomic region of interest, explore what information
Ensembl displays about it!
ANSWERS
(a)
- Go to the Ensembl homepage (http://www.ensembl.org).
- Select ‘Search: Human’ and type ‘4:52600000..53300000’ in the ‘for’ text box (or alternatively leave the ‘Search’ drop-down list like it is and type ’human 4:52600000..53300000’ in the ‘for’ text box).
- Click [Go]. This genomic region is made up of seven contigs, indicated by the alternating light and dark blue colored bars in the ‘Contigs’ track. One of them, AC107402.4, located between AC093858.2 and AC093880.4, is very small. The open part in the ‘Contigs’ track represents a gap in the genome assembly (although the human genome is called ‘finished’, there are still gaps!). Note that this region is very close to the centromere of the chromosome. If you cannot see the very tiny contig, you may want to increase the width of your display as follows:
- Click on ‘Configure this page’ in the side menu.
- Click on the ‘Configure page’ tab.
- Select a larger value from the ‘Width of image:’ drop-down list.
- Click [Save and close].
(b)
If the tilepath clones are not already shown:
- Click on ‘Configure this page’ in the side menu.
- Type ‘tilepath’ in the ‘Search display’ text box.
- Select ‘Tilepath - Normal’.
- Click [Save and close].
The tilepath clones do correspond to the contigs and it is easy to see from which BAC clone which contig sequence in the assembly is derived, e.g. AC027271.7 is derived from RP11-365H22, AC104784.5 is derived from RP11-61F5 etc.
(c)
- Draw with your mouse a box around the SGCB transcript.
- Click on ‘Jump to region’ in the pop-up menu.
(d)
- Click on ‘Configure this page’ in the side menu.
- Type ‘cpg’ in the ‘Search display’ text box.
- Select ‘CpG islands - Normal’.
- Click [Save and close].
There is indeed a CpG island located at the 5’ end of the SGCB transcript. Note that you also clearly can see this island in the ‘%GC’ track, that should be shown by default.
- Click on ‘Configure this page’ in the side menu.
- Type ‘eponine’ in the ‘Search display’ text box.
- Select ‘TSS (Eponine) - Normal’.
- Click [Save and close].
Eponine TSS-finder predicts a transcription start site almost exactly at the location where the Ensembl SGCB transcript starts.
(e)
- Click on ‘Export data’ in the side menu.
- Click [Next>].
- Click on ‘Text’.
Note that the sequence has a header that provides information about the genome assembly (GRCh37), the chromosome, the start and end coordinates and the strand. For example: >4 dna:chromosome chromosome:GRCh37:4:52886013-52905920:1
Exploring a SNP
TASKS
Especially in the clinical literature SNPs are often referred to by a gene name and a nucleotide or amino acid change instead of a dbSNP reference SNP (rs) number. An example of this is the non-synonymous SNP in the PTPN22 (Tyrosine-protein phosphatase non-receptor type 22) gene that has been identified as a genetic risk factor for type 1 diabetes. This SNP is often
referred to as ‘PTPN22 620W’ or ‘PTPN22 +1858C>T’ (see for example Zoledziewska et al. Diabetes 2008 Jan;57(1):229-34).
(a) Find the Ensembl page with information for this SNP.
(b) Why are the alleles on this page given as A/G and not as C/T? Why does
Ensembl puts the A allele first and why is in the literature the C allele put first?
(c) What is the minor allele of this SNP in Caucasians?
(d) Why does Ensembl puts the A allele first (A/G) and why is in the literature
the C allele put first (C>T)?
(e) According to the data imported from dbSNP the ancestral allele for this
SNP is G. Ancestral alleles in dbSNP are based on a comparison between
human and chimp. Does the sequence in gorilla, orangutan and macaque
confirm that the ancestral allele indeed is G?
(f) Has this SNP also been implicated by genome-wide association studies to
be associated with diseases / traits other than type 1 diabetes?
(g) Have a look at the spliced sequence of one of the PTPN2 transcripts that
contain this SNP. Make sure the complete spliced transcript sequence, the
coding sequence and the protein sequence are shown. Relative to which
feature of the transcript is the numbering 1858 in ‘+1858C>T’?
ANSWERS
(a)
- Go to the Ensembl homepage [1].
- Select ‘Search: Human’ and type ‘PTPN22 gene’ in the ‘for’ text box.
- Click [Go].
- Click on ‘Gene’ on the page with search results.
- Click on ‘Homo sapiens’.
- Click on ‘Ensembl protein_coding Gene: ENSG00000134242 (HGNC
Symbol: PTPN22)’.
- Click on ‘Genetic Variation - Variation Table’ in the side menu.
For SNPs in the coding sequence, the following method also works fine:
- Go to the Ensembl homepage [2].
- Select ‘Search: Human’ and type ‘PTPN22 gene’ in the ‘for’ text box.
- Click [Go].
- Click on ‘Gene’ on the page with search results.
- Click on ‘Homo sapiens’.
- Click on ‘Ensembl protein_coding Gene: ENSG00000134242 (HGNC
Symbol: PTPN22)’.
- Click on ‘ENST00000359785’.
- Click on ‘Protein Information - Variations’ in the side menu.
- Click on the Transcript IDs of the other six transcripts.
Two of the seven transcripts of PTPN22, ENST00000359785 and ENST00000354605, contain a non-synonymous SNP that results in a W/R change at position 620 in the encoded protein. So this SNP, rs2476601, must be the one we are looking for.
- Click on ‘rs2476601’.
Note that at the top of the page (after ‘Present in’) it is mentioned that this SNP is part of the ‘Clinical/LSDB (locus-specific databases) variations from 'dbSNP’ set. This is a reserved or "precious" set of clinically associated SNPs from dbSNP.
(b)
The alleles for this SNP are given as A/G, because these are the alleles in the forward strand of the genome. The SNP is in the literature referred to as ‘+1858C>T’ because the PTPN22 gene is located on the reverse strand of the genome, thus the alleles in the actual gene and transcript sequence are C/T.
(c)
- Click on ‘Population genetics’ in the side menu.
In Caucasians (CSHL-HAPMAP:HapMap-CEU population, ‘Utah residents with Northern and Western European ancestry) the minor allele is A. That this population is shown twice with slightly different allele frequencies is because SNP rs2476601 was submitted to dbSNP by multiple labs, that apparently obtained slightly different allele frequencies from their experiments.
These different submissions are indicated by different ssIDs (submitted SNP IDs), clustered into one rsID (reference SNP ID). See also http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=2476601#Diversity.
(d)
In Ensembl the allele that is present in the GRCh37 reference genome assembly is put first, i.e. A. In the literature the major allele is put first, i.e. C. Note that in this case the reference genome assembly (which is haploid and and constructed from multiple individuals) contains the minor allele of this SNP.
(e)
- Click on ‘Phylogenetic context’ in the side menu.
- Select ‘Alignment: 5 catarrhini primates EPO’.
- Click [Go>].
Gorilla, orangutan and macaque also have a G in this position, which confirms that G is indeed the ancestral allele.
(f)
- Click on ‘Phenotype Data’ in the side menu.
Genome-wide association studies have implicated this SNP to be also associated with Systemic Lupus Erythematosus (SLE), gender differentiated in women, Crohn’s Disease (CD) and Rheumatoid Arthritis (RA). See also SNPedia.
(g)
- Click on the ‘Gene: PTPN22’ tab.
- Click on ‘ENST00000359785’ or ‘ENST00000354605’.
- Click on ‘Sequence - cDNA’ in the side menu.
The numbering 1858 is relative to the start of the coding sequence (ATG). This is the second sequence shown in the figure.
Comparing a SNP in different individuals
TASKS
(a) Find the PTPN22 (Tyrosine-protein phosphatase non-receptor type 22) gene for human and go to the ‘Genetic Variation - Comparison image’ page for the transcript ENST00000359785.
(b) Do Venter and Watson have resequence coverage at the position of rs2476601 (also referred to as ‘PTPN22 620W’ or ‘+1858C>T’)?
(c) What are the genotypes of Venter and Watson at the position of rs2476601?
(d) Show the resequencing alignment for Venter and Watson for the 1000 bp region around this SNP. What does the R at the position of this SNP in the Watson sequence mean?
ANSWERS
(a)
- Go to the Ensembl homepage (http://www.ensembl.org).
- Select ‘Search: Human’ and type ‘PTPN22 gene’ in the ‘for’ text box.
- Click [Go].
- Click on ‘Gene’ on the page with search results.
- Click on ‘Homo sapiens’.
- Click on ‘Ensembl protein_coding Gene: ENSG00000134242 (HGNC
Symbol: PTPN22)’.
- Click on ‘ENST00000359785’.
- Click on ‘Genetic Variation - Comparison image’ in the side menu.
(b)
Yes, Venter and Watson both have resequence coverage >1 at the position of rs2476601, as indicated by the dark grey bar.
(c)
- Click on or mouse over the green/purple boxes for rs2476601 for
Venter and Watson.
Venter’s genotype is G|G (purple), while Watson’s is A|G (green/purple). Note that the (haploid) GRCh37 reference assembly has the minor allele of rs2476601, i.e. A (green).
(d)
- Click on the yellow box with ‘W/R’ in it or the open yellow box for rs2476601 at the bottom of the figure.
- Click on ‘Variation Properties’ in the pop-up menu.
- Click on ‘Jump to region in detail’.
- Click on ‘Resequencing’ in the side menu.
R is the IUPAC (http://www.bioinformatics.org/sms/iupac.html) or ambiguity code for an A or a G (both of which are puRines). Note that the resequencing alignment shows the sequence of the forward strand of the genome assembly. If you want the reverse strand instead, you can reconfigure the page using ‘Configure this page’.
Introduction to Galaxy
Galaxy addresses the need to have your most used tools accessible on one place. It also addresses the need to have your data available at one place. And best of all, Galaxy allows you to store what analyses you have done and share it with others.
Galaxy (http://main.g2.bx.psu.edu) is an very active open-source project, that gives you a framework in which to rapidly integrate new tools and put in new data. The best way to get started with Galaxy is watch some of the screencasts at the Galaxy wiki.
You can install Galaxy on your computer, or you might use the main portal (http://main.g2.bx.psu.edu). If you need more tools in your Galaxy, you find more in the Galaxy Toolshed at http://toolshed.g2.bx.psu.edu/. You can install more tools with one click.
Tutorial on Galaxy
When you have done an analysis, you can document all steps you have done into Galaxy pages. On the main Galaxy (http://main.g2.bx.psu.edu) you can find those pages under 'Shared Data' --> Published pages. Some of them are in fact tutorials on Galaxy.
If you are interested, you might follow this good starting tutorial to get your hands on galaxy: http://main.g2.bx.psu.edu/u/aun1/p/galaxy101