BioMart
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training
The purpose of BioMart is to provide uniform access to a set of different biological databases.
You can use the web portal, called Bio Portal, to do your searches or you can download and install the software on your computer. We will use the web portal in these exercises, so go to the BioMart home page.
A simple BioMart query involves
- choosing a dataset to search in
- setting filters to restrict the search space
- specifying the type of data you want to retrieve
*Exercise 1: human proteins with a retinol binding domain
Suppose we want to retrieve all human coding sequences of proteins with a retinol binding domain (IPR002449).
To start the search choose a database. For these exercises we will use ENSEMBL GENES 91.
Once you have selected a database, you can select a dataset from this database.
What is the dataset that you are going to search in ? |
---|
We want to retrieve human sequences so we need Homo sapiens genes. |
Next click Filters in the left menu to set filters on the search space. You can select filters by choosing or entering a value/option or by clicking a checkbox.
Set the filter for this search |
---|
We want to select based on the occurrence of a protein domain.
|
Next click Attributes in the left menu to choose the information that you want to retrieve. The Attributes (output types) are arranged into multiple sections which can be expanded. To choose an attribute simply click the checkbox next to its description.
Set the attributes for this search: Ensembl Gene ID, Ensembl Transcript ID and CDS. |
---|
To annotate the sequences we need Ensembl IDs so we check the Ensembl Gene ID and Ensembl Transcript ID checkboxes on the Features page of the Gene section (normally they are by default selected).
To retrieve the sequences we check the Coding sequence checkbox on the Sequences page of the Sequences section.
|
When you are happy with the query you can preview the results by clicking the Results button in the top panel.
Download the unique sequences in FASTA format. |
---|
If you want to proceed to export the results to a fasta file select Unique results only and click Go
|
Exercise 2: proteins from human chromosome Y
Retrieve the HGCN Gene symbols of the proteins from chromosome Y for further functional annotation.
What is the dataset that you are going to search in ? |
---|
The same as in the previous exercise: Homo sapiens genes. |
Filter on chromosome |
---|
This can be done in the REGION section of the filters
|
How many genes are located on chromosome Y |
---|
Click the COUNT button to see the results. |
Filter on the fact that they should have a HGCN symbol |
---|
This can be done in the GENE section of the filters
|
We want to use the results for an enrichment analysis so we need HGCN symbols only (the enrichment tools like DAVID only accepts one column of IDs as input).
Set the attributes |
---|
|
How many genes fulfill these criteria ? |
---|
Click the Count button in the top toolbar. |
Visualize all unique results. |
---|
Click the Results button in the top toolbar. Since there are 429 genes that fulfill the criteria that you have set you need to visualize all rows to get all symbols on one page. BioMart often generates duplicates so specify to View Unique results only. |
Copy the results and go to DAVID to do an enrichment analysis.
Perform the enrichment analysis on the human proteins of chromosome Y ? |
---|
Paste the symbols in the upload area (red), specify that the IDs that you use are official gene symbols (green).
Specify that the genes you upload constitute the list (blue) of genes you want to search for enriched annotations (compared to the background).
<p>Go to the List tab. Specify that these are human genes.
Now DAVID automatically chooses the full human genome as background. |
DAVID will now add annotations to these genes. It counts the number of times each annotation occurs in your list and compares this number to the average frequency over the complete genome. In this way it identifies genes that occur more frequently in your list than on average in the genome (=enriched).
It fetches annotations from different sources and you can look at the enriched annotations from each source separately. However, you can also look at the results of all sources combined, which is more informative in my opinion.
Perform the functional annotation clustering on the results of the enrichment analysis |
---|
Click the Functional Annotation Clustering button at the bottom of the page to get a consolidated view of all sources. |
Relevant enriched annotations include:
- spermatogenesis and related annotations
- sexual differentiation and related annotations
However, more striking is the enrichment of proteins involved in regulation of transcription linked to the following enriched annotations:
- RNA binding proteins and related annotations
- chromatin organisation and related annotations
*Exercise 3: ID conversion in BioMart
Remember the potential human TP53 targets from the exercises on finding TF binding motifs in DNA sequences. Originally, the ChIP Seq experiment generated a list of gene names of potential TP53 targets. To use them in one of the RSAT tools I had to convert the gene names to Ensembl Gene IDs.
Retrieve the Ensembl gene IDs of this list of genes ? |
---|
|
Similarly retrieve their probe set IDs on the Affy hg U95A arrays |
---|
|
Retrieve GO annotations based on their probe set IDs on the Affy hg U95A arrays |
---|
|