BioMart

Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

The purpose of BioMart is to provide uniform access to a set of different biological databases.

You can use the web portal, called Bio Portal, to do your searches or you can download and install the software on your computer. We will use the web portal in these exercises, so go to the BioMart home page.
A simple BioMart query involves

choosing a dataset to search in
setting filters to restrict the search space
specifying the type of data you want to retrieve

*Exercise 1: human proteins with a retinol binding domain

Suppose we want to retrieve all human coding sequences of proteins with a retinol binding domain (IPR002449).
To start the search choose a database. For these exercises we will use ENSEMBL GENES 91.
Once you have selected a database, you can select a dataset from this database.

What is the dataset that you are going to search in ?
We want to retrieve human sequences so we need Homo sapiens genes.

Next click Filters in the left menu to set filters on the search space. You can select filters by choosing or entering a value/option or by clicking a checkbox.

Set the filter for this search
We want to select based on the occurrence of a protein domain. Expand the PROTEIN DOMAINS AND FAMILIES section Set a filter on Limit to genes with these family or domain IDs Select InterPro IDs Paste the InterPro ID in the text box

Next click Attributes in the left menu to choose the information that you want to retrieve. The Attributes (output types) are arranged into multiple sections which can be expanded. To choose an attribute simply click the checkbox next to its description.

Set the attributes for this search: Ensembl Gene ID, Ensembl Transcript ID and CDS.
To annotate the sequences we need Ensembl IDs so we check the Ensembl Gene ID and Ensembl Transcript ID checkboxes on the Features page of the Gene section (normally they are by default selected). To retrieve the sequences we check the Coding sequence checkbox on the Sequences page of the Sequences section.

When you are happy with the query you can preview the results by clicking the Results button in the top panel.

Download the unique sequences in FASTA format.
If you want to proceed to export the results to a fasta file select Unique results only and click Go

Exercise 2: proteins from human chromosome Y

Retrieve the HGCN Gene symbols of the proteins from chromosome Y for further functional annotation.

What is the dataset that you are going to search in ?
The same as in the previous exercise: Homo sapiens genes.

Filter on chromosome
This can be done in the REGION section of the filters

How many genes are located on chromosome Y
Click the COUNT button to see the results.

Filter on the fact that they should have a HGCN symbol
This can be done in the GENE section of the filters

We want to use the results for an enrichment analysis so we need HGCN symbols only (the enrichment tools like DAVID only accepts one column of IDs as input).

Set the attributes

How many genes fulfill these criteria ?
Click the Count button in the top toolbar.

Visualize all unique results.
Click the Results button in the top toolbar. Since there are 429 genes that fulfill the criteria that you have set you need to visualize all rows to get all symbols on one page. BioMart often generates duplicates so specify to View Unique results only.

Copy the results and go to DAVID to do an enrichment analysis.

Perform the enrichment analysis on the human proteins of chromosome Y ?
Paste the symbols in the upload area (red), specify that the IDs that you use are official gene symbols (green). Specify that the genes you upload constitute the list (blue) of genes you want to search for enriched annotations (compared to the background). <p>Go to the List tab. Specify that these are human genes. Now DAVID automatically chooses the full human genome as background.

DAVID will now add annotations to these genes. It counts the number of times each annotation occurs in your list and compares this number to the average frequency over the complete genome. In this way it identifies genes that occur more frequently in your list than on average in the genome (=enriched).

It fetches annotations from different sources and you can look at the enriched annotations from each source separately. However, you can also look at the results of all sources combined, which is more informative in my opinion.

Perform the functional annotation clustering on the results of the enrichment analysis
Click the Functional Annotation Clustering button at the bottom of the page to get a consolidated view of all sources.

Relevant enriched annotations include:

spermatogenesis and related annotations
sexual differentiation and related annotations

These results were more or less expected for sex chromosome encoded proteins.

However, more striking is the enrichment of proteins involved in regulation of transcription linked to the following enriched annotations:

RNA binding proteins and related annotations
chromatin organisation and related annotations

The remaining 2 clusters have very high adjusted p-values so I would not consider these.

*Exercise 3: ID conversion in BioMart

Remember the potential human TP53 targets from the exercises on finding TF binding motifs in DNA sequences. Originally, the ChIP Seq experiment generated a list of gene names of potential TP53 targets. To use them in one of the RSAT tools I had to convert the gene names to Ensembl Gene IDs.

Retrieve the Ensembl gene IDs of this list of genes ?
Use the Human genes data set. Set the filter: In the Gene section select Input external reference IDs, select HGNC IDs and upload the file with gene names. Set the attributes: deselect Ensembl Transcript IDs Download the results

Similarly retrieve their probe set IDs on the Affy hg U95A arrays
Use the Human genes data set. Set the filter: In the Gene section select Input external reference IDs, select HGNC IDs and upload the file with gene names. Set the attributes: deselect Ensembl Transcript IDs and Ensembl Gene IDs. In the External References section scroll down and select the microarray. Download the results

Retrieve GO annotations based on their probe set IDs on the Affy hg U95A arrays
Select the data set you want to work on: Homo sapiens genes (GRCh38.p2) Then you filter the data set to the subset of human genes that you're interested in (the ones specified by the Affy IDs) by clicking Filters in the right menu Specifying the requirements the genes that you want to retain have to fulfill. Expand GENE and fill in as follows: Finally you have to define the attributes = which information you want to obtain for this subset of human genes. Click Attributes in the right menu Fill in as follows: GO Slims are cut-down versions of the GO annotation containing a subset of the terms in GO. For instance, they do not contain terms that are linked to a single protein. Since such terms are not relevant for enrichment analysis I tend to use GO-Slims if I can. To see the results click the Results button at the top of the left menu You see a lot of double entries: this is always the case with GO annotations. To remove doubles you can select Unique results only. You can export the results if you want to.

BioMart

*Exercise 1: human proteins with a retinol binding domain

Exercise 2: proteins from human chromosome Y

*Exercise 3: ID conversion in BioMart

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox