Archive for Module1

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

The ENA database

The European Bioinformatics Institute (EBI) hosts the ENA (European Nucleotide Archive) database: one part of ENA is called EMBL-bank, containing annotated primary sequence data. The other two parts are the Trace Archive and the Short Read Archive (SRA), containing batch-submitted primary sequence data.

EBI has multiple search portals:

  • ENA Browser to search in ENA
  • The fast search on the EBI home page and EBI Search perform a meta-search to all EBI databases (similar to Entrez)
  • SRS to perform searches on selected databases
     Note: information is liquid. Records change all the time: info is removed and added. Therefore, screenshots may not be up-to-date.

    The ENA Browser

    Go to the ENA Browser. You see two text field, the upper one for "Text search", the lower for "Sequence search". We will concentrate on the text search, sequence searches will be covered in Module 2. You can search using free text (e.g. species names, disease names, feature names,...) or using an accession number.

    Exercise 1: caspase

    Perform a search for 'caspase complete cds'.

    ENA.png

    The ENA search returns records from the EMBL-bank part of ENA, divided into "Update" and "Release". "Update" contains records that were recently updated. Clicking the "+" sign expands the corresponding section, revealing the individual search results.

    ENA2.png

    Each record can be further expanded by clicking on the "+" sign to see more details of the record. Do this for the first record of the "Update" list (JX912275 : Spodoptera frugiperda initiator caspase mRNA, complete cds).

    The most useful entries with the most relevant annotations are from the 'STD' (standard) data class. See more info on ENA database structure.

    Exercise 2: kinase

    The nicest thing about ENA Browser search, is the fact that the results are categorized by the part of ENA from which they originate. This becomes clear when you do a text search with "kinase".

    ENA5.png

    The results page groups the entries according to type of sequence.
    The text searches that you can perform using the ENA Browser are very 'crude'. For example, when you search for "kinase", every record containing somewhere the word "kinase" is shown, even non-kinase sequences just as in Genbank. Be aware of this because this is often not what you want!

    EBI Search, cross-database search at EBI

    EB-eye is a cross-database search tool for EBI databases similar to Entrez for NCBI databases. You can access it on EBI Search.

    Exercise 1: AF24735

    Search for "AF242735".

    ENA6.png

  • Click "Summary information is available for this gene"
  • Click "Dream"
    This redirects you to the EBI summary record of this gene

    ENA7.png

    EBI provides very nice overview pages, with links to many other databases. A good place to start.

    MRS

    NCBI, ENA end UniProt have all their own search engines. Would it not be convenient to have one search portal searching all databases?

    MRS (Maarten's Retrieval System) is a such webservice! Using MRS, you can access directly many different sequence databases at once. As a bonus, MRS handles synonyms and misspelled words very good by offering alike words for your search term.

    MRS: performing a simple search

    Go to http://mrs.cmbi.ru.nl. The search box at the top provides you access to different databanks with different content and from different origin (more info on which databases are available in MRS). MRS is ideally suited to do advanced searches, but let's start with simple. :-)


    Let's see if we can find some information about the Muckle-Wells syndrome, a human genetic disease. In the for box type "(muckle-wells) OR (muckle wells)" and click on the Search button on the right.

    Mrs start.png


    You will get a list of databanks, with for each databank how many "hits" were found, as well as some (!) of the hits (by default 2). You should see a series of sequence databanks as well as some other non-sequence databanks like OMIM or Entrez Gene.

    Mrs startresult.png

    Let's first look at the hits found in the sequence database EMBL: click on EMBL. You will get the EMBL entries matching your query. The ID column is the accession number of that sequence, the title gives a description.

    Mrs emblresult.png

    The pink bar measures the "relevance" of the hit (based on how often your query words "muckle" and "wells" occur in the text of the entry versus in the complete EMBL databank). Click on the accession number "af410477". General browsing tip: press ctrl+F to open a search box in your browser, and enter the search term af410477.

    Now we see all information from entry AF410477. MRS has displayed it in a "nice" way, but you can also have a look at the "raw" text: set the selector "View" to "Plain text" (in this case, EMBL format). Recapitulation: each line start with an index, the FT part is the largest and contains annotation information and cross-references to other databases. Using these indexes, relevant information is structured and hence can be easily searched by humans and computers.

    Mrs raw.png

    For now, go back to the nice version Mrsselectentryview.png.
    We have already introduced cross-references ('db_xref'): in the "nice" version most of them are displayed as hyperlinks, taking you directly to the linked database. For example, look for the coding sequence of this entry "af410477". What is a coding sequence? In bioinformatics, it is an annotation to a sequence, right? Hence you can find this under features (FT) section, more specifically the feature called CDS. For the CDS feature, some cross-references are present, after the qualifier db_xref.


    MRS: performing a more refined search

    The advanced search depends on indexes of the entries. Not all info of entries is indexed: which info is indexed, can be seen from the database summary page. This indexing is a commonly used technique to speed up database searches and to fine tune your searching. Let's give an example.

    I want you to get all the proteins of the Porcine Adenovirus. Let's do naive! Type in 'Porcine Adenovirus' in MRS, using UniProtKB. You see that it will not dproduce right results! You cannot rely on these results, they give you too much redundancy.

    So we need a better approach. Let's first set up our approach. What are all the proteins of the Porcine Adenovirus? These are all the gene product of the genes on the genome of Porcine adenovirus. So can we make use of the annotated genome sequence genome? Yes! From the genome annotations, we can retrieve the cds annotations to finally obtain all proteins. What we do:

    • we need the accession number of the genome.
    • Next we exploit the fact that databanks contain cross-references to each other, so that we can search for all UniProt entries (UniProt = the major protein database), which reference the EMBL entry from step 1 containing the genome. As such, we retrieve all protein sequences from porcine adenovirus.


    Step 1:

    - We search for the complete genome of porcine adenovirus, with making use of the appropriate indices.
    A genome sequence is a DNA sequence: hence we look for an EMBL record. Select with the Search selector "EMBL".
    The index for organism is OS (as for 'organism source'). Let MRS know that organism need to be porcine adenovirus, but typing "os:porcine os:adenovirus". To retrieve the complete genome, we search in the description (index "DE") for the complete genome, by typing: de:complete de:genome".
    How do we know the "OS" and the "DE"?? You can look up in MRS which fields are indexed per database. Also information is displayed on a line can be found there. Use it to your advantage: check the os and de field for EMBL!
    Search! You will probably get 4 entries, corresponding to 4 different strains.


    Mrssearch poradvir result.png

     MRS combines search terms with AND by default. You can explicitly use AND OR NOT, or use the symbols & | !

    Let's continue with ab026117. You can convince yourself that the words "porcine" and "adenovirus" and "complete" and "genome" can indeed be found in the "OS" and "DE" tagged fields. This genome contains coding sequences (CDS) for 16 proteins. Remember, we only needed the accession number of the genome: so take 'ab026117' with you to the next step.

    Step 2

    - Fetch all protein sequences which all cross-referenced to the porcine adenovirus genome.
    Nice trick! Simply search for the accession number of the genome (ab026117) in the Uniprot KB database. The search will find these back in the db_xref fields.
    We get a list of the 16 proteins encoded by this genome.


    Mrssearch poradvir resultprot.png


    Perhaps you can solve the next question, using the same techniques, by yourself. Remember: decide which database to use, which sequence you want, which indexes you need.



    For your gene of interest ...

    For your gene of interest, can you:

    • find the refseq nucleotide sequence?
    • find all primary sequence information back?
    • if it is protein coding, retrieve the 50% cluster of protein sequences at Uniprot?

    Conclusions of these exercises

    This wraps up our introduction to searching and downloading sequences from sequence databases.

    1. Information is stored in a structured way in databases,
    2. which allows rapid searching in databases.
    3. Knowing the structure is needed for an optimal search strategy.

    MRS gives you the power to query from a single access point many different databases. Carefully constructing and typing the query can be cumbersome: therefore, graphical interfaces, such as Entrez or ENA search provide a user-friendly portal for searching.

    Optional exercises

    Extracting sequence parts corresponding to annotations

    Annotations are very useful information. You might sometimes want to extract sequences corresponding to annotations. This can for example be done with SMS2, the Sequence Manipulation Suite , which is accessible through our website: BITS software --> Hosted Apps --> SMS2 (mirror), or directly by clicking here. As an example, sequence with accession number NM_010111 contains a lot of annotations, of which we are interested in the Sequence Tagged Sites (STS). We want to extract them in fastA format.

    • Download NM_010111 in Genbank format
    • Use SMS2 --> GenBank Feature Extractor to retrieve all STS sequences (paste the sequence in the box and click on submit).
    • Copy/paste the sequences starting with STS (use search function of your browser).
    Sms2example.png

    Changing sequence file formats

    We have seen many sequence file formats, in which metadata is documented in different ways. Not all tools accept all formats, causing a lot of trouble sometimes. SMS2 can convert you sequences between fasta, genbank and embl.

    Overview of the most important databases in MRS

    Since MRS is a metasearch engine of different databases, it has this nice feature of MRS: if you want more information about database currently accessible by MRS, you can click the button on top of the page, showing Databank: uniprot or Mrs embldb button.png,... Or you can go to the complete overview by clicking on Mrs statusbutton.png on the top bar.


    An example of a specialized sequence databank : search for an old vector sequence that you cannot easily find elsewhere

    The sequence of the expression vector pMAL-c has never been submitted to EMBL/GenBank/DDBJ. You can however find it in the Intelligenetics Vector databank, which has been composed from various sources. It has not been updated anymore since 1996 but is still interesting for finding some old vector sequences you can find nowhere else. The databank can be downloaded from the NCBI anonymous ftp server. There are also a few sites where you can search it on-line : the MRS server of the Belgian EMBnet Node (which will however soon disappear), the SRS server of the DFKZ (German Cancer Research Center) and the Web site of the University of Stanford.

    Go to http://genome-www.stanford.edu/vectordb and follow the links "Plasmid Vectors" and "Complete Sequences Starting PL-PS". You will however note that the vector we search is not in the page, presumably because of human error. We can however reach our goal by an unexpected route, exploiting the fact that the Google Web crawler reads and indexes the pages on the U. Stanford Web site! Go to http://www.google.be follow the link "Advanced search", type pmalc in the "all these words" box, type genome-www.stanford.edu in the "Search within a site or domain" box and click on "Advanced Search".

    Hop, you should have found an entry with entry name PMALC and accession number IG1997. Note that there are some modern vectors that you can find nowhere except maybe inthe release notes of the company that makes them.

    Google seqsearch.png


    Taxonomies

    Taxonomy is an NCBI database. It is the one and only reference for species classification of all sequences. You can search this database via the Entrez portal http://www.ncbi.nlm.nih.gov/Entrez. Search for "hiv" and click through until you get to a Taxonomy record. Every species and node on the taxonomical lineage receives a Taxonomy ID. In bioinformatics terms, a species is referred to by this ID. From the Taxonomy database, you can retrieve sequences per species, which is sometimes very convenient.

    Exercise 1: Arabidopsis mitochondrial genome

    Exercise 2: mouse genome sequences

    Find taxonomy info for mouse using Entrez.




    The Rebase databank

    Rebase is a databank with information about restriction enzymes, maintained by R. Roberts at New England Biolabs, a company that sells, among other things, restriction enzymes. Suppose you need for one of your experiments an enzyme specific for the sequence CCCGGG. Go to http://rebase.neb.com/rebase/. Make sure the "choose search category..." selector stands on "recognition sequence", then type cccggg in the box underneath and click "Go".

    Rebase select.png

    You should obtain a list with more than 20 restriction enzymes, as well as some methyltransferases. Look at one of the restriction enzymes. Note the "Prototype" field and follow the hyperlink to SmaI. Of all enzymes specific for CCCGGG, SmaI is the most readily available or otherwise the "most representative". Note that SmaI makes blunt ends ; some other enzymes in the list cut in a different position or the exact location of the cut is not known. If you click on "Similar enzymes" you get a complete list.

    Rebase result.png

    You can obtain a list of the companies where you can order SmaI by following the "Commercially Available" link in the SmaI page or the link below "Suppliers" in the Similar Restriction Enzymes page.

    Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training