Multiple sequence alignment

Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

FAQ

Generating alignments

Exercise 1: toy example comparing several algorithms

Several multiple sequence algorithms exist, each with their own program and format. Fortunately, you can also find tools to convert MSA formats.

Let's discover how some of these programs behave, and which suit your needs best! As always, EBI provides a nice collection of web-based alignment tools but you can also make MSAs in Ugene. Some MSA tools allow manual editing of alignments, which will be discussed in the next exercise.

Open Ugene. We are first going to use different tools for the toy example below:

>Sequence1
GARFIELDTHELASTFATCAT
>Sequence2
GARFIELDTHEFASTCAT
>Sequence3
GARFIELDTHEVERYFASTCAT
>Sequence4
THEFATCAT
>Sequence5
GARFIELDTHEVASTCAT

You can download the sequences in fasta format. Do not open the text file in Ugene. The tools for multiple sequence alignment can be found in the Tools menu.

Parameters of Clustal Omega.
Clustal Omega has the following parameters: The number of iterations to improve the MSA are defined by the Number of iterations parameter. Clustal Omega generates a guide tree (like all MSA algorithms) but it changes this tree by replacing the two most similar sequences by a model that represents their alignment and recalculating the guide tree but now with the model instead of with the two separate sequences. The number of times it performs this step to change (and improve) the guide tree is defined by the Max number of guide tree iterations parameter. Apart from refining the guide tree, Clustal Omega will also refine the MSA based on the guide tree by randomly removing one sequence from the MSA, making a (hidden Markov) model of the remaining alignment and realigning the removed sequence to the model. The number of times it refines the alignment is defined by the Max number of HMM iterations parameter. Iterations come at a cost, they increase the run time. Each iteration will add 1-3 times it costs to make the alignment without iterations. That's why the default setting is to not use the iterations. The sequences in this example are too short and simple to see the impact of iterations so we are going to use the default settings.

Align these sequences with ClustalOmega using default parameters.
In the top menu select Tools Select Multiple alignment Select ClustalO This opens the Clustal Omega parameters window: Specify the file Garfield.txt that you have downloaded as input file Specify an output file e.g. GarfieldClustalO Click Align

This results in the following alignment:

Ugene colours the amino acids according to percentage identity. Residues that are conserved in many sequences are colored darker blue than residues that only occur in a few sequences.

You can change the coloring scheme by clicking the Hightlighting button in the right menu.

Color the alignment according to the Zappo scheme.
Change the coloring scheme by clicking the Hightlighting button (red) in the right menu.

The Zappo scheme colors the sequences according to their biophysical properties:

negatively charged amino acids (D and E) are coloured red
positively charged amino acids (R, H and K) are coloured blue
amino acids with polar uncharged side chains (S, T, N and Q) are coloured green
amino acids with aromatic side chains (F, Y and W) are coloured orange
...

At the top of the alignment you see the consensus sequence giving a one line representation of the alignment:

if each sequence contains the same amino acid on a position: the amino acid is printed in capitals
if the majority of the sequences contain the same amino acid on a position: the amino acid is printed in small letters
if all amino acids on a position are similar: a + is printed
if the amino acids on a position are different: a - is printed

Above the consensus sequence you see the conservation scores that represent the similarity on each position: the higher the score the more similar the amino acids on that position are.

Now we will make the alignment using ClustalW2.

Parameters of ClustalW.
In contrast to Clustal Omega, which replaces amino acids by numbers to speed up pairwise similarity score calculations, ClustalW builds the guide tree based on real pairwise alignments. To make these pairwise alignments a scoring system is used that you can define (red). For proteins, a complex scoring system is used with specific gap penalties for regions that are rich in one (type of) amino acid e.g. G-rich regions or hydrophilic regions are more likely to contain gaps so gap penalties are reduced in such regions. You can also define a minimal distance between gaps (gap separation distance): gaps that are closer than this distance are extra penalized except when they are at the ends of the alignment (no end gap separation penalty). By default, ClustalW performs progressive alignment (no iterations) but you can use iterations by changing the iteration type parameter (green). If you want to do full iterations (guide tree and MSA) choose TREE or only refine the MSA (and not the guide tree) choose ALIGNMENT. The out sequences order parameter (blue) determines the order of the sequences in the alignment: based on the guide tree (Aligned) or the default: in the same order as in the input file (Input).

Make the alignment using default parameters.
Using the default parameters, ClustalW2 returns the following alignment:

Although EBI advises to use ClustalOmega, see the Please Note message on the ClustalW2 page

you can see that for this example the ClustalW2 alignment is better.

Parameters of MUSCLE.
Another increasingly popular alignment algorithm is Muscle. It has the following parameters: For most applications, you use the Muscle default mode. When you have thousands of sequences or when the sequences are very long, you can run Muscle with only 2 iterations (Large alignment mode). To refine an existing alignment use the Refine only mode (red). To keep the sequences in the same order as in the input file select Do not re-arrange sequences (green). You can define the speed of the analysis by setting the maximum number of iterations or the maximum time you want to spend on this analysis (green).

Align the Garfield sequences with MUSCLE.
Using the default settings, the following alignment is generated:

The MUSCLE alignment looks even better than the ClustalW2 alignment.

Another alignment algorithm is MAFFT, which states that it is one of the most accurate multiple sequence alignment methods currently available.

Align the Garfield sequences with MAFFT using the default parameters.
MAFFT parameters refer to calculating similarity scores from pairwise alignments and the number of iterations you want to perform. It gives exactly the same alignment as Muscle.

From the examples above, we see that these four algorithms give three different alignments. We can see this easily with this small toy example, but when you want to align large sets of sequences it's not so easy to see which algorithm performs best.

If you have hundreds of sequences to align, you have to take processing speed into account. In this case, your best options would be MAFFT (best quality, bit slower), and Clustal Omega (very fast, good quality).

The most popular algorithm is ClustalW, which makes use of the progressive alignment algorithm that was described in the slides but can add iterations for refinement. However, this alignment algorithm is slow, compared to the other algorithms.

A lot of multiple sequence alignment programs exist. Make your selection of MSA programs based on:
1. what you have access to
2. the number of sequences
3. the type of sequence (DNA/protein)

Changing and editing alignments

Most of the time, you are not perfectly happy with a MSA that is generated by an MSA tool and you want to change the alignment yourself. You can use these free alignment editors:

MEGA --- (very powerful, for generating, visualizing and editing alignments)
SeaView
BioEdit

Or you can edit the alignment in Ugene.

Navigating the alignment

At the bottom of an alignment you see the overview, it shows the coverage of the complete alignment. Using the overview, you can see the regions of the alignment with many gaps and those without gaps. You can navigate to these parts of the alignment by clicking a region in the overview.

You can change the information shown by the overview. By right clicking the overview you can choose to show "Simple" overview which is a bird-eye view your alignment with the selected color-scheme.

Alternatively, you can navigate the alignment by dragging the sliding window to move across the alignment in the main window.

Changing the colors

How to colour the alignment according to conservation using percentage identity as a measure ?
In the Right menu: Click the Highlighting tab to open the Options panel (red) In the Color section select Percentage Identity (green) Close the panel by clicking the Highlighting tab again Now each position in the alignment is coloured according to this colour scheme: the darker the blue, the more conserved the position is.

Selecting a part of the alignment

How to select a part of a MSA ?
Select a portion of the alignment in Ugene by dragging a rectangle over that part of the alignment Right click the selection Select Copy Select Copy selection

Editing the alignment

Editing is done in two directions:

You delete divergent sequences from the alignment.
You also remove uninformative positions: these are positions that do not contain information on the evolutionary relation between the sequences. These positions do not contain phylogenetic information since you don't have a sequence for the other organisms there. The only thing it tells us is that these residues exist in one organism but not in the others. So you have to remove positions where you only have a sequence for one organism and not for the others.

<p>If you have sequences that clearly diverge a lot from the rest of the sequences, containing large regions that do not match the others, you should remove them from the alignment before you make a tree. When you are doing this for real (because you want to include a phylogenetic tree in your publication for instance), you should first try to find the reason why the sequences are so different from the rest. In many cases, it will be because of errors in the annotation e.g. an intron that was not correctly annotated. A wrongly annotated intron can have a major impact on the resulting protein sequence. So first check the genomic sequence before you actually remove sequences from the alignment. For time's sake, we skip this step and simply remove divergent sequences from the MSA.

How to remove divergent sequences from the alignment ?
To remove the divergent sequences from the alignment: select the sequence by clicking its name right click the sequence name select Edit in the drop down menu select Remove current sequence

Smaller regions with differences can be tolerated.

How to remove uninformative positions from the alignment ?
You can do this by selecting a subalignment (the part in the alignment you want to remove): place your mouse cursor on the top of the alignment (where the positions in the alignment are displayed) where you want to start the selection select the positions you want to remove by holding the right mouse button release the mouse button when you have made the complete selection. Right click the selected subalignment Select Edit Select Remove selection

The position will disappear. You see that editing the alignment can be a lot of work, especially for proteins that are not very conserved. Fortunately, more and more tools for constructing phylogenetic trees will remove these positions automatically.</p>

In the same way you could remove positions where all sequences agree since they are also not informative for constructing a phylogenetic tree. However, in practice this is never done because an alignment without fully conserved positions looks strange.

It is always better to use multiple tools for constructing the MSA and to compare their results (we are not going to do this for the sake of time but you should when you want to make a phylogenetic tree for your research)

After you have removed all divergent sequences and uniformative positions we can use the alignment for phylogenetic tree construction. Save the project by clicking the Save All button in the top toolbar.

Constructing a phylogenetic tree

Ugene has three methods to calculate a tree:

one based on maximum likelihood
one based on neighbour joining
one based on Bayesian statistics

How to create a tree in Ugene ?
In the top menu: select Actions select Tree select Build Tree This opens the Build Phylogenetic Tree window Set the Tree building method to the algorithm you want to use.

The NJ method (Phylip) does not use a real model of evolution (only a score matrix). It simply calculates the distance between the sequences and assumes that these sequence distances reflect the genetic distances between the species.

Parameters of Phylip.
Distance matrix model: score matrix used to calculate distances between the sequences Gamma distributed rates across sites: distances are corrected to take into account unequal rates of change at different sites. It is assumed that these evolution rates follow the gamma distribution Coefficient of variation of substitution rate among sites: becomes available if the Gamma distributed rates across sites is checked. Specifies the coefficient of the gamma distribution Transition/transversion ratio: expected ratio of transitions to transversions (only for nucleotide alignments), see more info To enable bootstrapping go to the Bootstrapping and Consensus Tree tab. The following bootstrapping parameters are available: Number of replicates: number of bootstraps he should do. Seed: random number. Is generated automatically but you can change this value in order to make results of different runs reproducible. Consensus type: specifies the method to build the consensus tree. Strict: a set of species must appear in all bootstrap trees to be included in the consensus tree. Majority Rule (extended): any set of species that appears in more than 50% of the bootstrap trees is included in the consensus. The program then considers the other sets of species in order of decreasing frequency, adding them to the consensus tree if they are compatible with the tree until the tree is fully resolved. This is the default setting. M1: includes in the consensus tree any sets of species that occur among the bootstrap trees more than a specified fraction of the time. Majority Rule: a set of species is included in the consensus tree if it is present in more than half of the bootstrap trees. The Display options tab specifies how to display the tree.

PhyML and MrBayes use a model of evolution in their calculations.

Parameters of PhyML.
Substitution model: score matrix Equilibrium frequencies: Empirical: equilibrium frequencies of amino acids are estimated by counting the occurence of the amino-acids in the alignment. Optimized: for nucleotide sequences, optimizing nucleotide equilibrium frequencies means that the values of these parameters are estimated in the maximum likelihood framework. For protein sequences, the equilibrium amino-acid frequencies are either those defined by the substitution model. Transition/transversion ratio: expected ratio of transitions to transversions (only for nucleotide alignments), see more info. You can choose between fixed to their initial values (specified by the substitution model) or optimized during the maximum likelihood calculations. Proportion of invariable sites: the expected frequency of sites that do not evolve, can be fixed (default is fixed to 0) or estimated during the maximum likelihood calculations. The default assumes that each site in the sequence may accumulate substitutions at some point, even if no differences across sequences are actually observed at that site. Number of substitution rate categories: Evolution rates vary from site to site. This can be modeled using a discrete gamma distribution. The categories of this discrete distribution correspond to different rates of evolution. The default is 4. It is not wise to go below 4. Larger values are preferred but will take more time. A reasonable number is 20. Gamma shape parameter: the shape of the gamma distribution that models the rate variation across sites. Small values correspond to large variability. The gamma shape parameter can be fixed by the user or estimated via maximum-likelihood. On the Branch support tab you select the method that is used to measure branch support: a fast likelihood method or bootstrapping. On the Tree searching tab you select the algorithm you are going to use to determine the shape (topology) of the tree: Type of tree improvement: Default is NNI. The second approach relies on subtree pruning and regrafting (SPR). It generally finds better tree topologies compared to NNI but is also significantly slower. The third approach BEST, simply estimates the phylogeny using both methods and returns the best solution among the two Set number of random starting tree When SPR or BEST is selected, is is possible to use a random tree as a starting tree. If this option is turned on, five trees, corresponding to five random starts, will be estimated. The output tree file will contain the best tree found among those five Optimize topology: the shape of the tree is determined by maximising the likelihood (see slides). Default is yes. It is possible to set this to no when one wants to compute the likelihood of a tree given as input. The Display options tab specifies how to display the tree.

Parameters of mrBayes.
The settings on the Model tab are linked to specifying the model: setting the score matrix and the variation of mutation rates between positions. Phylip only calculates one tree whereas PhyML and MrBayes calculate a representative set of all possible trees and then identify the best one. This means that the two latter methods are slow. To speed up calculations you can decrease the number of possible trees that are evaluated by setting Chain length (How many trees are assessed?) and Subsampling frequency (How many times do you calculate a score?) in the MCMC tab. Heated chain temp defines how much difference there should be between two consecutive trees. In the Display options tab select Display tree in new window.

How to change the width of the branches to get a nicer looking tree.
Red box on the figure:

The numbers on the tree reflect evolutionary distances, expressed in expected number of substitutions per position in the alignment. The number on a branch is based on the combined scores of all possible trees that contain that specific branch.

Improving the visualisation of the tree with Phy3D

Ugene automatically saves trees in Newick (.nwk) format. This format is accepted by PhyD3, so you can play with the tree in PhyD3. PhyD3 is a cool interactive tree viewer developed at VIB.

How to load the tree in PhyD3 ?
Click the Submit button Click Browse to select the .nwk file from your computer Click the Send button

The tree that is displayed is a phylogram: the length of the branches represent evolutionary distances. If you want to display a cladogram deselect Show phylogram.

Newick is a pretty basic format but if you use PhyloXML files you can include sequences, taxonomy and domain annotations. PhyD3 can incorporate this info in the phylogenetic tree image:

Sequence logos

A good way to visualise alignments with lesser sequence similarity (like the one of the histone proteins) is by sequence logos. Good tools for generating sequence logos are

iceLogo: a VIB tool that generates a logo by calculating frequencies of amino acids in each position of the MSA and comparing them to a reference set (typically the full proteome of an organism). In this way you can account for the fact that amino acids that occur frequently in the proteome will by nature also occur frequently in your MSA and are 'less relevant' than amino acids that occur rarely in the proteome.
WebLogo: does not use a reference set, so assumes all amino acids occur equally often in the genome
enoLOGOS (Energy NOrmalized logos): corrects for biases in amino acid distribution.

On the Y-axis scores are shown in bits, the X-axis shows the position in the alignment. Bits are calculated in log2-scale. Since there are 20 amino acids the maximum bits-value is log2(20) = 4,32. In a sequence logo the height of an amino acids reflects its frequency in the alignment and is presented in bits.

How to create a logo with WebLogo
Paste the selection in the Multiple sequence alignment box in WebLogo, increase the height of the logo to 8 cm and click Create Logo:

Multiple sequence alignment

Contents

FAQ

Generating alignments

Exercise 1: toy example comparing several algorithms

Changing and editing alignments

Navigating the alignment

Changing the colors

Selecting a part of the alignment

Editing the alignment

Constructing a phylogenetic tree

Improving the visualisation of the tree with Phy3D

Sequence logos

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox