DNA sequence analysis

Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Gene regulation via Ensembl

At lot of sequence analysis has already been done for you in specialized sequence databases such as Ensembl. Go to the region from bp 52,000,000 to 52,200,000 on human chromosome 4.

Zoom in on the SGCB transcript, including a bit of flanking sequence on both sides.
In the middle graphic select the Drag/Select button and draw with your mouse a box around the SGCB transcript. If you don't see the button, it means that Drag/Select has already been activated.
Click Jump to region in the pop-up menu

CpG islands are genomic regions that contain a high frequency of CG dinucleotides and are often located near the promoters of mammalian genes. There are tools that predict potential CpG islands based on nucleotide composition. These predictions can be visualized in Ensembl.

Is there a CpG island located at the 5’ end of the SGCB transcript ?
First of all, locate the 5' end of the SGCB transcript: in the bottom graphic you see that SGCB is located under the dark blue line representing the genome sequence (contig). That means that the gene is located on the reverse strand. Click Configure this page in the side menu Type cpg in the Find a track text box Select CpG islands Close the configure page When you scroll down at the most detailed display, you see that there is indeed a CpG island located at the 5’ end of the SGCB transcript.

Possible transcription start sites can be predicted using the Eponine tool. Eponine predictions are one of the tracks available in Ensembl.

Is there a transcription start site predicted by Eponine annotated for the SGCB transcript?
Click Configure this page in the side menu Type eponine in the Find a track text box Select TSS (Eponine) Close the configure page When you scroll down at the most detailed display, you see that Eponine indeed predicts a transcription start site almost exactly at the same location where the Ensembl SGCB transcript starts.

An important mechanism of gene regulation is the accessibility of the DNA. For human and mouse Ensembl provides several tracks for visualizing the status of the chromatin, based on data from the ENCODE project.

This project aims to experimentally characterize all functional elements in the human and mouse genome: TFBS, TSS, open chromatin regions, non-protein-coding transcripts... The project obtains this information among others from large scale ChIP-Seq or DNA-Seq experiments like the sequencing of DNA regions that were cut by DNase to identify closed and open chromatin regions in the DNA.

Is the chromatin in the vicinity of the SCGB gene open or closed in H1ESC cells ?
Click Configure this page in the side menu Scroll to the Regulation section in the left menu and click Open chromatin & TFBS This opens a matrix with proteins (TFs and DNase) in the rows and cell types in the columns. Select DNase as a protein and H1ESC as cell type. These cells are embryonic stem cells (see overview of cell types used in ENCODE). Close the configure page When you scroll down at the most detailed display, you see that the DNase method identifies an open chromatin region that overlaps with the CpG island and the predicted TSS.

Similarly you can also define TSS as the regions that bind to the RNA polymerase. Encode performed ChIP-Seq experiments on various cell type samples to identify DNA regions that bind to RNA polymerase.

Does the experimentally determined TSS correspond to the one that was predicted by Eponine ?
Click Configure this page in the side menu Scroll to the Regulation section in the left menu and click Histones and polymerases This opens again a matrix with proteins (polymerases and modified histone proteins) in the rows and cell types in the columns. You can choose between Pol II and Pol III. In eukaryotes, Pol III transcribes DNA to synthesize ribosomal 5S rRNA, tRNA and other small RNAs while Pol II catalyzes the transcription of DNA into mRNA, most snRNA and microRNA. We are interested in the generation of mRNA so we select PolII as a protein and H1ESC as cell type. Close the configure page When you scroll down at the most detailed display, you see that the fragment that binds to PolII according to the ChIP-Seq experiment indeed overlaps with the predicted TSS.

Finally, DNA accessibility can also be regulated by modification of histone proteins. DNA winds itself around these proteins and as such they regulate the accessibility. If the DNA is tightly wound aroound them, it is not accessible. If the DNA loosens from the histones it becomes accessible. how well the DNA can bind to the histones is determined by the structure of the histones: i.e. which modifications are on the surface of the histone proteins ? Methyl groups, acyl groups or other modifications all influence the binding of the DNA .

ENCODE performed ChIP-Seq experiments with antibodies that target modified histone proteins to identify the DNA regions that are bound to them.

Is the regulatory region at the 5' end of the SCGB transcript bound to methylated H3K4 ?
Click Configure this page in the side menu Scroll to the Regulation section in the left menu and click Histones and polymerases This opens the matrix with proteins (polymerases and modified histone proteins) in the rows and cell types in the columns. Select H3K4m2 (doubly methylated) and H3K4m3 (triple methylated) as proteins and H1ESC as cell type. You may also select the single methylated form, I just didn't do that in the example. Close the configure page When you scroll down at the most detailed display, you see that the region is bound by doubly and triple methylated H3 in embryonic stem cells. Methylation of the fourth amino acid residue (K) from the N-terminus of histone H3 is one of the most studied histone modifications, and with good reason: it’s tightly associated with the promoters of active genes. Like all lysine residues, H3K4 can be mono, di, or tri methylated. H3K4me3 is the strongest activator of expression.

Finally, you can also search for binding sites of specific TFs in Ensembl, again based on ENCODE data.

Which TFs have a binding site in the CpG island of the SGCB gene in embryonic stem cells ?
Click Configure this page in the side menu In the Regulation section click Open chromatin and TFBS Set Filter on Transcription Factor This track contains information from ENCODE ChIP-Seq experiments. As such it specifies the location of TFBS in the genome. You see a matrix with TFs in the rows and cell types in the columns. Hover your mouse over the H1ESC. A box will appear in which you can select to visualize all TFs for which data are available will on the track. Close the configure page You can see that TAF1, TAF7 and Yy1 have a binding site in the CpG island. TAF1 and TAF7 are subunits of TFIID, which binds to the TATA-box in the Transcription Start Region to position the RNA polymerase properly on the DNA. Yy1 represses transcription by directing histone modifying proteins to promoter regions.

RNA processing via miRWalk

MiRWalk is a database that contains information on miRNAs from Human, Mouse and Rat. For each miRNA, it contains predicted as well as experimentally validated binding sites in their target genes.

In a mouse study, you find that the expression of TIMP3 influences the development of atherosclerosis. A decreased expression of TIMP3 stimulates inflammation of the arterial endothelium and triggers atherosclerosis.

You have searched for TFBS for TFs that might be the cause of the TIMP3 repression but to no avail. Maybe the expression of TIMP3 is regulated via other mechanisms, like miRNA induced degradation.

For decades, scientists have assumed that miRNAs bind to sequences in the 3'UTR of target genes.

Is TIMP3 expression regulated by binding of miRNAs in 3'UTR ?
Go to the MiRWalk website. Select Predicted targets in the top menu. Select Gene targets in the drop down list. In the first section, set Species to Mouse and paste the name of the gene. In the Gene region section select the region of the gene to search in, by default this is set to 3'UTR In the Options section, you select the minimum length of the seeds used for starting the alignment (similar to what happens in BLAST) and the confidence level you want to work at (default p-value = 0.05: you allow 5% false positive predictions) In the prediction programs section choose at least two algorithms to compare their predictions. Click the SEARCH button. The results page contains a list with links for the gene to retrieve data including genomic location, synonyms, Refseq IDs, homology information, external links, information on protein families, functional annotation... a list with potential miRNA binding sites within the sequences of the regions you searched in. If you searched in a region, you can click its link: Promoter, 5'-UTR, CDS and 3'-UTR. When you click the results for the 3'UTR, you indeed get a list of miRNAs that in theory could bind to the 3'UTR of TIMP3 .

You spend a lot of time checking if these miRNA are the cause of the decreased TIMP3 expression in atherosclerosis but you find no link at all. Quite frustrated you decide to perform a final analysis with an open mind. Although everybody states that miRNAs target 3'UTRs, you decide to scan the CDS for potential targets.

Do you find miRNA targets in the CDS of TIMP3 ?
Go to the MiRWalk website. Select Predicted targets in the top menu. Select Gene targets in the drop down list. Fill in the form as follows and click SEARCH: miRWalk allows you to search in any component of a gene: 3'UTR, CDS, 5'UTR and promoter. You get a list of miRNAs that in theory could bind to the CDS of TIMP3, one of them is mir-172 (red) .

You go back to the lab and find that atherosclerosis activates mir-172, which in its turn decreases the expression of TIMP3, thereby worsening the atherosclerosis.

It is important to search miRNA binding sites within the complete sequence (promoter, 5'-UTR, CDS and 3'-UTR) of a gene !

Does mir-172 target other CDS ?
For this we need to do the inverse analysis: start from a miRNA and predict which genes it targets. Fortunately, miRWalk can perform both analyses. Go to the MiRWalk website. Select Predicted targets in the top menu. Select miRNA targets in the drop down list. The miRNA-target search page is organized similar to the Gene-target search page. You select a species, provide an ID of the miRNA, select the regions you want to search in, set the options, select the algorithms you want to use for the predictions and click SEARCH. You get a list of CDS that can bind to mir-172, one of them is Pla2g6 (red). You go and search for a link between this gene and atherosclerosis in Pubmed and bingo: 10 publications containing the name of this protein and atherosclerosis.

Find regulatory motifs using MotifLab

We have identified a number of genes that respond to DNA damage and stress, initiating cell-cycle arrest or triggering apoptosis. Download a list with the Ensembl IDs of the potential targets. The first thing you need to do is creating a new protocol (=analysis). I want to know which TFs might be responsible for the observed upregulation of these genes in response to stress.

How to create a new protocol ?
Click the New Protocol button in the top menu

If you want to repeat your analysis later, you can document everything you do now via the record function.

How to record everything you do ?
Click the Record button in the top menu This will record your actions so it becomes easy to repeat the analysis on the same or other data.

The first thing you have to do is to import the sequences of the promoters you want to analyze.

How to import sequences ?
You can import promoter sequences by just providing Gene IDs or Ensembl IDs. MotifLab is coupled to Ensembl and will fetch the appropriate promoter sequences automatically. Click the Add Sequences button in the top menu Paste the Ensembl IDs or import the file containing the IDs (red). Note that you can choose the exact sequence of the promoters relative to the transcription start or end site (green). Click OK. In the Features section of the left menu you now see an icon representing your set of promoter sequences (red). Clicking the Visualization tab (green) will visualize the sequences.

Next you need one or multiple motifs. Jaspar and the public part of Transfac are by default in MotifLab but you can add your own or other public motif collections.

How to import motifs ?
In the Motifs section of the left menu click the +sign and select MotifCollection. As stated previously MotifLab has a number of predefined motif collections. Choose Jaspar Core and click OK. MotifLab has now loaded all Jaspar core motifs as you see in the Motifs section of the left menu. We only want to use motifs from human. To do this Deselect Jaspar Core in the left menu (red) Add a new Motif Collection Open the Manual Selection tab (green) In the Filter section set Organisms = human and press Enter (blue) Select all filtered motifs (purple) MotifLab creates a new motif collection called MotifCollection2 with only the 79 human motifs. Deselect Jaspar Core, deselect MotifCollection 2 and select it again. Check if all motifs from MotifCollection2 are selected.

How to scan the sequences for potential matches to the motifs ?
Right click the DNA icon in the Features section Select Perform Operation, Motif, motifScanning Now appears a window where you can set the parameters of the search. The Source is the collection of sequences you want to search in: in our case DNA You have to provide the name of the track you want to store the results in The Method specifies the algorithm you use for the searches. MotifLab has a number of built-in algorithms but you can also add tools via Configure -> External tools. We will use the SimpleScanner. The MotifCollection defines the set of motifs we want to search for, in our case MotifCollection2 the set of human motifs from Jaspar (red). The Percentage defines the miminum percentage similarity between the motif and the sequence. Set it to 90 (green). MotifLab will scan the sequences for motifs and displays them as a separate track under the sequences. When you hover your mouse over them you get extra information If you want to arrange the overlapping motifs neatly under each other right click the binding sites track and select Expand track.

DNA sequences contain repeat regions that might interfere with the motif searches. It is recommended to remove repeat regions from your sequences. MotifLab allows you to do so.

Motif searches always generate an enormous amount of false positives (regions in the DNA sequence that are similar to the motif but do not actually bind to the TF). You can reduce the number of false positives by focussing on regions in the promoter that are conserved between species. This is called phylogenetic footprinting. It is assumed that regions that are functionally important (like TFBS) do not change during evolution because if they change, their function would be lost.

MotifLab allows you to add feature annotation tracks to display and use annotation of the sequences, like the presence of conserved regions, repeats, open chromatin regions...

We will use the following tracks: Conservation and RepeatMasker

How to add these feature annotation tracks?
Click the Add Feature Data Track button in the top menu This opens the Datatracks dialog. Here you can select the tracks you want to visualize. Tracks with red buttons are not available for the organism you are working on. Select the Conservation and RepeatMasker track by clicking them. You can select both of them simultaneously by holding the Ctrl key. The annotation is now visualized.

Now you can use the annotation to remove motifs that are located in repeat regions or in regions with low conservation.

How to filter motifs in repeat regions or regions with low conservation?
Right click the binding sites track and select Perform Operation -> Transform -> Filter This opens the Filter dialog. Set the condition used for the filtering as follows This will remove motifs in regions with low conservation. Click the + sign to add a second condition. Right the AND and select Add New Condition to Group. Set the second condition as follows Right the AND and select Change Operator to OR to remove motifs that are in low conserved regions OR that are located in repeats. You see that this greatly reduces the number of motifs that are shown in the track. The only motifs that are retained are those in conserved regions and outside repeat regions.

You can do the same with other annotations e.g. remove motifs in closed chromatin regions...

You should, however, when you do this for real, use a better algorithm than the Simple Scanner. You can for instance download and add Clover.

Finding enriched TF binding motifs in a list of genes via Pscan

We performed additional experiments to find additional genes that show the same expression pattern as the ones in the previous exercise, so we now have an extended list of coexpressed genes.

Using PScan can you find other TF binding motifs that are enriched in the promoters of our coexpressed genes ?
Go to Pscan Paste the RefSeq accession numbers of our target genes in the Insert Gene/Sequence ID list box Pscan accepts the following gene or transcript identifiers: RefSeq (for human, mouse, and drosophila, e.g. NM_000546) TAIR (e.g. AT1G08810) for Arabidopsis; SGD (e.g. YPL248C) for yeast. Next, you have to specify the source organism as well as the region you want to be analyzed (relative to the annotated TSS). Click Run! The output shows the ranking of the TFs selected according to the enrichment p-values of their PWM. You see that the TP53 motif is strongly enriched in the promoters of these genes. Additionally you also see enrichment of motifs that bind HFN1A, a tumor marker just like TP53. By clicking on a matrix name, you can open a dedicated page showing detailed results: matrix, logo, and links to the database entry as well as to the ID (PMID) of the PubMed entry describing the generation of the PWM. A simple graphic representation shows the average matching value of the matrix in the sequences in our list of TP53 targets (top) compared to the average matching value and standard deviation (in green) on a set of all promoters (same size of regions with respect to the TSS as selected) of the same organism (bottom). Sample statistics shows the statistics of the enrichment analysis: the p-value, the Bonferroni corrected p-value, mean and standard deviation of the matching score on the list of TP53 targets, and number of sequences in the list. These latter pieces of information can be useful to compare the results of different data sets by using the Compare with.. box next to the statistics.

The p-value should be read as: "If we take as many random genes from the same organism as in the input set, and we select their corresponding promoter regions, what is the probability of having the same score as obtained in the input set?"

However, Pscan checks 130 matrices (or 280 if you select Transfac) so 130 independent statistical tests are performed. This means you have to correct for this if you do not want to end up with a long list of false positives (the p-value implies that the TFBS is enriched while in fact it is not).

Pscan uses a Bonferroni correction, meaning that the typical significance threshold (0.05 or 0.01) is divided by the number of tests (in our case: 130). This means that a significance threshold of 0.05 becomes 0.0003 after Bonferroni correction. When you use this value as significance threshold, the score for HFN1a is not significant. However, it should be noted that if you do so many t-tests the Bonferroni correction is not the ideal choice: it's too stringent in these cases.

In the exercises above we searched for known TF binding motifs but often you do not know the motif that regulates the expression of your genes of interest and you have to search for unknown motifs.
There are several tools to solve the complex problem of de novo identification of motifs that are enriched in a list of sequences. The wikipedia page has a very nice overview on these tools (see [1]).

What you need as an input is a list of promoters of which you are more or less certain that they all bind to the same TF, you just don't know which TF. If you have such promoter sequences you can search for common motifs that occur in all promoters.

Besides the very popular MEME suite, you can also use one of the RSAT tools for this.

You have to do the following steps to do the analysis:

Get promoter sequences using the RSAT tool retrieve Ensembl seq or retrieve sequence
Analyse them for common motifs using one of the RSAT tools under Motif discovery

EMBOSS: translation of nucleic acid into protein

The majority of protein sequences are derived from annotated nucleic acid sequences using a translation tool.

Go to the EMBOSS home page
Search the page for translation (using Ctrl + F)
In the NUCLEIC TRANSLATION section in the left menu, click transeq

Translate the CDS of the mouse basic domain/leucine zipper transcription factor L36435.
Go to the NCBI home page Search sequence L36435 Send the Coding sequences to a FASTA Nucleotide file Store the file under an informative name in a location on your computer where you can find it back Upload the file in the Transeq input form Click Submit If you did everything right you should obtain a protein sequence starting with a methionine (M) and ending with a *, which represents a stop codon.

Translate the CDS of the unknown human sequence that you can download here.
In this sequence you don't know where the CDS is located but you do know it's there. Fortunately you can use transeq to find a CDS without knowing in which frame you should look. To do this you can select 6 (All six fames) in the Parameters section of the tool to obtain all reading frames.

Of course, this doesn't work when you start from a genomic sequence of a gene that contains introns. Translation tools do not take into account the presence of introns.

A good alternative for the EMBOSS Transeq tool is the Expasy Translate tool.

GENSCAN to predict eukaryotic genes

Translating DNA (see previous exercise) and gene prediction are two very different things.
The translation software will translate any DNA sequence that you feed it regardless of whether the DNA sequence really is a CDS or not. Gene prediction software on the other hand looks for sound evidence (base composition, similarity to known CDS, presence of motifs...) to predict the location of CDS in a DNA sequence.

The Genbank entry with accession number X02419 contains the sequence of the gene encoding the urokinase-type plasminogen activator. We will run gene prediction software on the sequence and see if the software manages to correctly find the CDS. GENSCAN is one of the most popular tools for gene searching. It is used by the genome annotation pipeline at EBI, while a slightly modified version (Gnomon) is used at NCBI.
GENSCAN models eukaryotic genes with HMMs (hidden markov models) for the:

coding parts of eukaryotic exons
TATA-box
CAP-site
polyA-site

The software offers these models for human (usable for all vertebrates), Arabidopsis thaliana and maize.

GENSCAN is available via the Mobyle portal. Go to the Mobyle home page.

In the left menu expand sequence
Expand nucleic
Expand gene_finding
Click genscan

In the Input section:

Paste the sequence in FASTA format (red)
In the Organism section, choose HumanIso (green)
Click the Run button (blue)

The Mobyle server will ask for your email address, provide it and click Ok.
The Mobyle server will ask you to validate your submission, do so and click Ok.
You can compare the result with the annotation in the Genbank file of the sequence: it has Genbank accession X02419.

Did GENSCAN predict the CDS correctly ?
You can choose to see the GENSCAN prediction full screen The Mobyle portal is not always working well so if it doesn't generate an output you can go to the genscan page at MIT. Copy the reformatted sequence from the Mobyle results page: and paste it in the GENSCAN page at MIT in the input sequence text box. Click Run Genscan. You get the same results as shown above for the Mobyle portal. The annotation of the Genbank file: You will see that GENSCAN correctly predicted the CDS.

GENSCAN also found a polyA-site in the correct location. Note that GENSCAN does not find non-coding exons like the first exon in the Genbank file.

Selecting primers for PCR

Designing regular PCR primers using Primer3

The RefSeq entry NM_079400 contains the sequence of the D. melanogaster mRNA coding for tap, the target of Poxn. Tap encodes a bHLH protein expressed in larval chemosensory organs and involved in the response to sugar and salt. We wish to amplify by PCR the region coding for the HLH motif.

Find out which part of the sequence corresponds to the Helix-loop-helix domain.
Go to the Genbank record and click on the Gene link or go directly to the Gene record via GQuery. In the NCBI Reference Sequences section of the Gene record you can find the location of the domain: 155-211. This location applies to the protein sequence so it's expressed in amino acids. This means you have to calculate the corresponding positions in the nucleotide sequence: amino acids are encoded by codons of 3 nucleotides so you have to multiply the amino acid positions by three: 155 * 3 = 465 and 211 * 3 = 633 the sequence of the Genbank record contains more than the CDS: the CDS starts at position 112 (green box on figure) meaning that the start codon is located at position 112 in the sequence of the Genbank record. So you have to add 112 to the nucleotide positions: 465 + 112 = 577 and 633 + 112 = 745 In the nucleotide sequence of the RefSeq mRNA record, the domain is located between position +577 and +745.

The Primer3 program can help to select suitable primers.

Go to Primer3 ^[1] or use the nicer interface at Primer3Plus ^[2] (in the solution below Primer3Plus was used).

Note the many configurable parameters of Primer3!

Use Primer3 to select primers to amplify the HLH motif of tap.
Copy the full sequence from the Genbank record and paste in the sequence input box (green). Type in the Targets box 577,168 (blue) for targeting the hlh domain. You see that you can very specifically define the location of the primers: Excluded regions define the region in the sequence that is discarded for primer selection: primers will not lie in this region. Targets defines the region that you want to amplify: primers must surround this region closely. Included region defines the region that should be used to search for primers: primers must lie in this region. Go to the General Settings tab (red) and set Maximum Tm Difference to 2.0 Note the Product Size Ranges box: given these settings Primer3 will first try to find primer pairs that amplify a product with length between 500 and 600 and if this is not possible it will go for a larger product etc... This is, however, not what we want: we want to amplify the hlh region only so we aim for a much smaller product. Since the hlh region is 168 bp long we set this as the lower limit for the product size range and we'll take 300 bp as the upper limit. Click Pick Primers

In the results page you can see the following:
The program has selected 5 primer pairs, with the best at the top.
For each primer you can find:

the location on the sequence where the first base of the primer binds (for the forward/left primer this is on the complementary strand)
its length
its melting temperature
its %GC
the score of an alignment of the primer with the other primer and with itself

The program first considers all possible primers and retains those that satisfy a series of criteria, then considers all possible pairs of these primers and retains those who satisfy a series of criteria. At the bottom of the page you can see information about how many primers were considered and rejected/retained.

Check the specificity of the first set of primers.
Click the Send to Primer3Manager button (red) for the first primer pair. This leads to a form for further evaluating the primers: Click the BLAST! link to check the specificity of the primers. The primers are sent to a blastn form allowing you to iniate a BLAST search to find potential targets of the primers. By default the nr database is selected. Of course you are only interested in targets in the organism you are working with. Hits in other organisms will not influence your PCR. So Set the Organism to Drosophila melanogaster (remember that the sequence we want to amplify comes from fruitfly). Click the BLAST button.

Look at the 8 highest scoring hits for the forward primer.

Is the forward primer really identical to 8 different locations of the Drosophila genome ?
No what you see here are 8 records that all contain the tap sequence (mRNA, chromosome, synthetic constructs...) Remarkably some of them are annotated as the biparous TF, but if you take a closer look you will see that tap and biparous are the same gene. This is the problem when working with the nr database which in contrast to whats it's name lets you believe is not non-redundant at all.

So it's always a better idea to Blast against Refseq sequences unless they are not available yet for the organism you work with or you have reasons to believe that they are not complete (i.e. they do not represent the full genome). However, for model organisms, full chromosome sequences are available so you can use Refseq for BLAST.

Repeat the Blast on RefSeq
you are going to do PCR on genomic DNA of Drosophila so you want your primers to specifically amplify the poxn domain from the genomic DNA. This means that you have to select refseq_genomic as the database to search in Set the Organism to Drosophila melanogaster Click the BLAST button Now you see that the redundancy has disappeared.

Go to the results of the reverse primer.

When evaluating the specificity of the primers, it is especially important to check that the primers are specific at the 3' end because that's the site where the polymerase will attach nucleotides.
So generally, it is recommended to not use primers that contain long identical stretches (>15 nt for primers of 20 nt long) to other regions in the genome apart from the one you want to amplify/measure, and certainly not if these stretches comprise the last nucleotide at the 3' end of the primer.
When you look at the alignment of the first hit on chromosome X of the reverse primer you see that this is a borderline case: an identical stretch of 15nt including the nucleotide at the 3' end !

Designing regular PCR primers using PrimerBLAST

It's a lot of work to design primers using Primer3/CLC Main Workbench especially if you discover during the BLAST that the suggested primers are not specific and you have to start all over again.

Therefore, we will now design primers using the PrimerBLAST tool ^[3] of NCBI. PrimerBLAST uses the Primer3 algorithm to find primers allowing you to set all the parameters related to effciency of the primers (e.g. secondary structure and dimer formation, Tm, difference in Tm...). On top of this, PrimerBLAST uses the BLAST algorithm to find specific primers, that will only bind to the intended target. It is the only tool that incorporates the BLAST in the primer search.

For some applications, it is vital to have specific primers. This is for instance the case when you use PCR for quantification (qPCR) or when you want to amplify a commonly occurring sequence from the genome.

As an example, we will design primers to amplify a region of RefSeq entry NM_079400, which represents the D. melanogaster mRNA coding for tap, the target of Poxn. Tap encodes a bHLH protein expressed in larval chemosensory organs and involved in the response to sugar and salt. We wish to amplify the region encoding the Helix-loop-helix domain by PCR. In the nucleotide sequence of the RefSeq record, the domain is located between position +577 and +745.

So, let's go to the PrimerBLAST page.

How to set the general primer parameters ?
Fill in the first two section of the BLAST form as follows: Note that you do not have to fetch the sequence, you can directly enter the RefSeq ID (red). To amplify the helix-loop-helix domain you need a primer at both sides of the domain. To accomplish this you need to define the regions in the sequence that you want to use for selecting primers in by defining a range for both primers as shown in green. Set the product size (blue) and the maximum difference in anneal temperatures of the primers of a pair (purple) to reasonable values (e.g. take the same values as in Primer3).

Remember that PrimerBLAST also allows you to set the parameters related to specificity of the primers.

How to set the parameters related to primer specificity ?
Fill in the first two section of the BLAST form as follows: Compare the primers to the Drosophila (green) genome (red). You see that you can easily define the specifity that you want to obtain (blue): you can set the minimum number of nucleotides that should differ between the primers and the rest of the Drosophila genome the minimum difference at the 3' end of the primers

Only primers that fulfill the specificity criteria that you have set will be shown.

How to set the parameters related to primer efficiency (hairpins, dimers, length...) ?
Expand the Advanced parameters section: This shows a set of advanced parameters concerning the primer specificity (Primer Pair Specificity Checking Parameters). I normally keep these at their default settings. If you want to make your search more stringent you can decrease the Blast word size. This is the minimum number of consecutive nucleotides that have to exactly match between the primer and a sequence from the database to consider it a hit. The lower you set it, the faster BLAST will consider a sequence from the database a hit. After that follow the advanced Primer parameters. These are the ones that we are interested in. At the top you have a set of general parameters that are quite self-explanatory. Then follow the parameters concerning the likelihood that the primers will form hairpins and dimers. By default Primer-BLAST uses secondary structure alignment to predict these characteristics but you can choose to use more advanced and accurate thermodynamic models. Calculations will be slow when you choose these models though. Basically, instead of looking at individual nucleotide matches you use the stability of dinucleotide matches, e.g. the stability of CT hybridized to GA is different from that of CA hybridized to GT. The scores for self and pair complementarity in the default secondary structure are local alignment scores: the scoring system gives 1.00 for complementary bases, -1.00 for a mismatch, and -2.00 for a gap. So a score of 8 corresponds to 8 consecutive matches or 9 aligned nucleotides with 8 matches and 1 mismatch...

Click the Get Primers button (purple) to start the search. After a few seconds the search returns 5 specific high quality primer pairs:

Knowing that PrimerBLAST uses the same algorithm to pick primers as Primer3 ^[4], PrimerBLAST will save you a lot of work and time when you have to design primers.

Designing primers for qPCR

Exercises on qPCR primer design can be found on our primer design wiki page.

References:

[1] ttp://frodo.wi.mit.edu/

[2] ttp://primer3plus.com/cgi-bin/dev/primer3plus.cgi

[3] ttp://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=BlastHome

[4] ttp://www.ncbi.nlm.nih.gov/tools/primer-blast/primerinfo.html

[1]

[2]

[3]

[4]

DNA sequence analysis

Contents

Gene regulation via Ensembl

RNA processing via miRWalk

Find regulatory motifs using MotifLab

Finding enriched TF binding motifs in a list of genes via Pscan

EMBOSS: translation of nucleic acid into protein

GENSCAN to predict eukaryotic genes

Selecting primers for PCR

Designing regular PCR primers using Primer3

Designing regular PCR primers using PrimerBLAST

Designing primers for qPCR

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox