PubMA Exercise.6
Full analysis workflow using CLC main workbench
[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.5 | PubMA_Exercise.6 |
| Analyze GEO data with the Affymetrix software ]
The following content was directly taken from the current CLC documentation (Feb17, 2014) and applies only for people with access to the CLC Main workbench
Contents
- 1 CLC Tutorial material
- 1.1 Obtaining public microarray data for the Workbench
- 1.2 Getting your own Affymetrix microarray data into the correct format for the Workbench
- 1.3 Loading data into the workbench
- 1.4 Setting up a microarray experiment in the Workbench
- 1.5 Log transformation of the data
- 1.6 Quality control of the data
- 1.7 Identification of DE genes
- 1.8 Main results and specific settings
- 1.9 Enrichment analysis in CLC Main
- 2 Additional Info
- 3 Published results
- 4 download exercise files
CLC Tutorial material
Users from VIB labs or people having a license for the CLC Main workbench can proceed with this exercise during the afternoon open session or later from home. The final CLC project folder can be downloaded from the BITS server ('Heart vs Diaphragm.zip') and imported in the User CLC project-manager.
Obtaining public microarray data for the Workbench
The Workbench supports analysis of one-color expression arrays. These may be imported from GEO. The Workbench supports the following formats:
- GEO SOFT sample-files: simple line-based, plain text files that contain all the data and the descriptive information of a microarray experiment (example SOFT-file)
- GEO series-file: txt-files containing the definitions of a group of related samples. They contain tables describing extracted data, summary conclusions and analyses. Each Series-file is assigned a unique GEO accession number (GSExxx).
See our tutorial on downloading data from GEO for a detailed description on how to obtain the required files for import into the Workbench. If the file is compressed, unzip it before you import it in the Workbench.
Getting your own Affymetrix microarray data into the correct format for the Workbench
The Workbench assumes that expression values are given at the gene (probe set) level, thus probe-level analysis of Affymetrix arrays and import of Affymetrix CEL- and CDF-files is not supported. However, you can import your own Affymetrix data via two ways:
- as .CHP files generated by Affymetrix Expression Console containing normalized Affymetrix data
See the section on how to convert .CEL files to .CHP files using the Expression Console for a detailed discussion on how to do this. Use RMA for the normalisation ! - as .txt files exported from R containing normalized Affymetrix data
Expand this section to see how you can do the normalization in R. To create these txt files, open R or RStudio as administrator and install the following packages: - Matrix
- lattice
- fdrtool
- rpart
Then run the following code:
# Install all required Bioconductor packages source("http://bioconductor.org/biocLite.R") biocLite() biocLite("affy")
You also need Bioconductor packages for the annotation of the spots on the arrays. In the example below I use the annotation packages for the Arabidopsis ATH1 array. Of course, you need the annotation packages that correspond to the arrays that you used in your experiment.
There are two possibilities for obtaining these packages:- Bioconductor has a list of annotation packages (generated by Affymetrix). In this list find the name of the cdf file that corresponds to the array that you have used (we used ATH1 arrays so the cdf file is called ath1121501cdf):
To use this cdf file execute the following R-commands:
biocLite("ath1121501cdf") # load Bioconductor libraries library("ath1121501cdf") library("affy") # specify path on your computer where the directory that contains the CEL-files is located celpath = "C:/Users/Janick/My Documents/R/win-library/2.14/affydata/celfiles/" # import CEL files containing raw probe-level data data = ReadAffy(celfile.path=celpath)
- BrainArray provides a list of custom annotation packages. To use these cdf files, download the zip file from the website. Install it from the local zip file:
Then execute the following code:
# load Bioconductor libraries library("affy") # specify path on your computer where the directory that contains the CEL-files is located celpath = "D:/R-2.15.2/library/affydata/celfiles/Apum/" # import CEL files containing raw probe-level data data = ReadAffy(celfile.path=celpath) # indicate you want to use the custom cdf # If you don't specify the cdfname, Bioconductor will use the default Affymetrix cdf. data@cdfName="ATH1121501_At_TAIRT"
You can find the name of the custom cdf by going to the package folder:
Open the DESCRIPTION file and look in the Description line:
Then proceed by executing the following R-code:
# normalisation using 'rma' algorithm. data.rma=rma(data) # The output is an exprSet object with a data matrix containing normalized log-intensities (on probe set-level) in the exprs slot. # writing probe-set level data to a file called 'data.txt' write.exprs(data.rma,file="data.txt")
The resulting text file can be imported into the Workbench.
Loading data into the workbench
Open the Workbench, create a New Folder called Microarrays in the Navigation Area and Import the unzipped file into the Microarrays folder. Expand the Microarrays folder and you will see 12 arrays in your Navigation Area.
Double clicking a file will open it in the Main Area. The first column of the microarray files contains probe set IDs like 200007_at, 200011_s_at, 200012_x_at. The second column contains signals.
Always check the signals ! They should be small, all below 15: this means that the file contains log transformed and normalized data. If this is not the case, you can open the file in a text editor like Notepad and check the headers of the file: they will contain a description of the processing steps that were done on the data. You will find descriptions like: gene expression was quantified by robust multi-array analysis (RMA) using the Genomic Suite software from Partek or Robust Multichip Average (RMA) method was used to do background correction, quantile normalization, and expression level summarization…
If you are sure that the data has been normalized and log transformed, you can immediately perform statistical analysis on the data. Otherwise, you can do log transformation in CLC.
In the file that we are using, no description of the data processing is given. From the values we see that log transformation has not been done but we do not know if normalization has been performed or not. Some form of processing must have been performed, otherwise we would have expression values per probe (as in raw Affymetrix data) instead of per probe set (= gene) as is the case here.
Setting up a microarray experiment in the Workbench
To analyze differential expression, you need to tell the Workbench how the samples are related. This is done by setting up an Experiment. An Experiment is the central data type when analyzing expression data in the Workbench. It includes a set of samples and information about how the samples are related (which groups they belong to).
How to set up a microarray experiment in the Workbench ? |
---|
To set up an experiment: Toolbox -> Expression Analysis -> Set Up Experiment
Select the 12 arrays that you have imported
Click Next.
Click Next.
Click Next.
Click Finish.
The table includes the expression values for each sample and in addition a few extra values have been calculated such as the range, the IQR (Interquartile Range), fold change and difference values. Save the experiment. |
Log transformation of the data
If the data has not yet been log transformed as is the case in our example, it is mandatory that you do so before you start generating plots or analyze the data.
If you work with your data, you will not need to do this step because you have normalized your data by RMA prior to import into the Workbench. The RMA normalization algorithm performs a log transformation on the data. If, for any reason, you used the MAS algorithm for normalization, log transformation will not have been done and you need to do it now.
If you work with public data, normalization should have been performed because the Workbench only accepts SOFT or Series files (= normalized data). Since most people use RMA for normalization the data should be log transformed. However, this is not always the case because of mistakes made during submission, the usage of normalization algorithms that do not perform a log transformation or taking the antilogs after normalization.
How to log transform the data ? |
---|
To perform a log transformation of the intensities: Toolbox -> Expression Analysis -> Transformation and Normalization -> Transform
Select the 12 arrays. Click Next and choose Log 2 transformation.
Click Finish.
There is also an extra column for transformed range, IQR, difference and fold change. For each group there is an extra column for transformed means. |
Warning:
Although the Workbench gives you the option, never use square root transformation on microarray data. Always choose a log transformation
Quality control of the data
MA plots
In most cases, you start with data that is already normalized in the Workbench so you cannot make the typical before and after normalization MA plots. But even if you can only generate after normalization MA plots you can still check whether the non-linear effect was more or less removed by the normalization (see slides).
How to create MA plots in the Workbench ? |
---|
To create MA plots: Toolbox -> Expression Analysis -> General Plots -> Create MA Plot
Originally MA plots were developed for two-color arrays to detect differences between the two color labels, and for these arrays they became hugely popular. This is why people also want to use them for Affymetrix arrays but on Affymetrix arrays you use just a single color label. So people use MA plots to:
The Workbench does not allow you to create a pseudo-array so your only option is to perform pairwise comparisons. As an example we will select the first diaphragm and the last heart array.
When you click Next you can choose between the original expression values, the transformed expression values (if you have done a log transformation on the data in the Workbench) and the normalized expression values (if you have done normalization on the data in the Workbench). Always use log transformed values for making MA plots.
Click Finish |
The symmetric and even spread of the data points around the M=0 line indicates that the non-linear effect has been removed by the normalization.
Box plot
Again, you will not be able to make the typical before and after normalization box plots. But even if you can only generate after normalization box plots you can still check whether the data distributions of the samples are more or less equalized by the normalization (see slides).
How to create a box plot in the Workbench ? |
---|
To create a box plot: Toolbox -> Expression Analysis -> Quality Control -> Create Box Plot
Select the experiment and click Next.
You can choose between the original expression values, the transformed expression values (if you have done a log transformation on the data in the Workbench) and the normalized expression values (if you have done normalization of the data in the Workbench). Always use log transformed values for making box plots.
Click Finish |
As expected after normalization, the box plot of our data looks very good because none of the samples stands out from the rest. The different arrays have the same (or at least a very comparable) median expression level. Also the scales of the boxes are almost the same indicating that the spread of the data on the different arrays is comparable.
You do see small differences e.g. on the second diaphragm array the spread of the data is a bit larger than on the other arrays. That is normal since RMA normalization performs an additional step (probe level normalization) after equalizing the data distributions. This step will reintroduce some variation between the data distributions of different samples.
If you see large differences between the boxes, you can equalize the data distributions by performing a normalization in the Workbench (Toolbox -> Transformation and Normalization -> Normalize)
PCA plot
How to create a PCA plot in the Workbench ? |
---|
To create a PCA plot: Toolbox -> Expression Analysis -> Quality Control -> Principal Component Analysis
Select the experiment and click Next. |
PCA performs a transformation of the data onto principal components (PCs). PCs are directions of the largest variation in the data set.
The two PCs are directions (= axes in space) so you can project each sample on these axes. The spread of the projected samples will be the maximum among all possible choices of axes. Projecting the samples on these PCs generates the plot that you see above.
When you have two groups in your data you should see a clear distinction between the two groups as is the case in our example. In the plot, the dots are colored according to the groups. To see which sample corresponds to a dot, place the mouse cursor on the dot for a second, and you will see the name of the sample.
Additional info: clear explanation of PCA
How to display the names of the samples on the PCA plot ? |
---|
You can also display the name in the plot using the settings in the Side Panel to the right:
In this way you can control the coloring and dot types of the different samples and groups. |
To complement the PCA, we will also do a hierarchical clustering of the samples to see if the samples cluster in the groups we expect.
How to cluster the samples ? |
---|
Toolbox -> Expression Analysis -> Quality Control -> Hierarchical Clustering of Samples
Select the experiment and click Next. Now you can choose the two parameters of the clustering:
However, since the Workbench does not allow to filter out genes with small variation in expression over the different samples, none of the proposed metrics is ideal. Let’s take an example:
Array 3 (green) shows the same expression behaviour as array 1 (blue) but the expression levels on array 3 are lower than on array 1. Since they show the same behavior, Pearson's correlation is 1 and they will cluster together. The Euclidean distance depends on the absolute differences in expression: since the genes in array 3 show moderate and the genes in array 1 show very high expression levels, the Euclidean distance between the two arrays is quite large (20). So Euclidean distance will probably not cluster them into the same group. On array 2 (red) all genes can be considered off (not expressed) and on array 1 (blue) all genes show different but high expression levels. Pearson’s correlation is 1 (perfect correlation) so both arrays will be clustered together. This is of course not what you want. The Euclidean distance is large (43,56) so this metric will keep the arrays apart. On array 5 (orange) all genes have high expression levels with very small fluctuations. However, when you calculate Pearson's correlation, these fluctuations are inflated and the correlation with array 1 (blue) is very high (0,98). So these two arrays will be clustered together by Pearson and not by the Euclidean distance. On array 4 (purple) the genes have high expression levels as in array 1(blue). However, the genes show different behavior: in array 4 gene B is much higher expressed than gene A and gene C, in array 1 this is just the opposite. Pearson's correlation will therefore be low and will not cluster them while the Euclidean distance is very small and will definitely cluster them. Overall, Euclidean distance is recommended. 2. The method for calculating distances between clusters
Make your choice of parameters, choose to use the log transformed data (Original expression values or Transformed expression values depending on your data set) and click Finish. |
The clustering generates a heat map showing the cluster tree at the bottom. The two overall groups formed by the clustering are identical to the grouping in the experiment, which is what we want. You can double-check by placing your mouse on the name of the sample - that will show which group it belongs to.
Since both the PCA and the clustering confirms the grouping of the samples, we have no reason to be sceptical about the quality of the samples.
Identification of DE genes
How to identify DE genes ? |
---|
To carry out a statistical test to identify DE genes:Toolbox -> Expression Analysis -> Statistical Analysis -> On Gaussian Data
Tests based on the Gaussian distribution assume that the data comes from a normal distribution. This means that you can only use these tests on log transformed data. These tests include t-tests and ANOVA. Select the experiment and click Next. Now you can choose the test you want to perform. When you want to compare two groups (like our experiment) you have to use a t-test. When you want to compare three or more groups you have to use an ANOVA first followed by pairwise t-tests.
The choice of t-test depends on the assumption you make about the variances in the groups. By selecting Homogeneous the test assumes that the groups have equal variances. When In-homogeneous is selected, this assumption is not made. It has been shown that unequal group variances is common is microarray data so it is recommended to choose In-homogenous. However, the standard remedy for unequal group variances, the Welch test, is only valid for experiments with a large number of samples. For small numbers of samples a moderated version of the Welch test is recommended. From the CLC documentation it is not clear which remedy is used in the Workbench. So for small number of samples you might want to choose Homogeneous. <p>If you want to know for sure if the group variances in your data are equal you can use an F-test of equal variance. You cannot do this test in the Workbench but you can do this test in R. If the data is paired, e.g. you use the same individuals before and after treatment, the Use pairing tick box should be active. If ticked, a paired t-test will be calculated, if not, the standard t-test will be used. We will choose In-homogenous and click Next |
Warning:
The standard t-test that is used by the Workbench to find DE genes is known to have poor sensitivity in microarray data analysis (see slides) and a moderated version of the t-test is normally used (cfr limma). Therefore, it is possible that the Workbench produces over-optimistic results. You have to realize this when you use the lists of DE genes for further analysis.
In an experiment with 3 or more test you normally do ANOVA first to check if there is a difference between the groups, followed by pairwise tests that determine which group differs from which group exactly. You can mimic this in the Workbench by first doing an ANOVA and then pairwise t-tests. You can choose to do t-tests for all pairs of groups by clicking the All pairs button or to have a t-test produced for each group compared to a specific reference group by clicking the Against reference button. In the latter case you must specify which group you use as reference.
Warning:
Normally you use Tukey's test as a follow-up to ANOVA. Tukey's test is similar but not identical to a t-test and it does correct for making multiple comparisons. Doing pairwise t-tests will not correct for making multiple comparisons!
Main results and specific settings
Only key steps are reproduced here to provide information to interpret the figures. All other steps and parameters will be explained in the tutorial PDF files linked above.
After computing group-wise differential expression, a filtering step is applied to the full table to retain only DE genes with at least 2-fold change in expression, with an adjusted p-value of at most 5x10-3 and with expression data (present calls) in at least 4 of the 6 replicates.
The classical volcano plot is produced with in red the 142 DE genes selected during filtering. This subset will be used as 'test' set against all other genes in the data table in the hypergeometric enrichment analyses detailed below.
Due to the logarithmic nature of the data (transformed), the 'Difference' column should be used instead of 'Fold-Change' to represent the differential expression
Enrichment analysis in CLC Main
Hypergeometric test
The following figure shows data-samples used in the hypergeometric tTest
A window allows selecting the annotation type to be used in the test and the action to take with duplicate probes (here: 'merged by gene symbol to the highest IQR').
The next figure details the Top-30 hypergeometric results for the contrast Heart-vs-diaphragm
Another page presents details about the parameters used in the different tests
- results of the hypergeometric test for GO-BP
- results of the hypergeometric test for GO-MF
- results of the hypergeometric test for Pathways
GSEA
The GSEA method does not require partitioning the data as for the hypergeometric test, it takes the full table and considers the relative ranking of gene list members in relation to the individual gene expression levels in the data.
- Settings for the GSEA test for GO-BP
- Settings for the GSEA test for GO-MF
- Settings for the GSEA test for Pathways
Top-10 GSEA results for BP increased in Heart
Top-10 GSEA results for BP increased in Diaphragm
Additional Info
Tutorials on microarray analysis from CLC:
- Expression_analysis_part_I.pdf
- Expression_analysis_part_II.pdf
- Expression_analysis_part_III.pdf
- Expression_analysis_part_IV.pdf
Published results
The original publication ends with functional enrichment results identifying key differences between 'heart' and 'diaphragm' tissues in rats. We link here to the results published by 'van Lunteren E, Spiegler S, Moyer M' [1]. Full details about this dataset can be found on the http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE6943 GEO page.
download exercise files
Download exercise files here.
References:
- ↑
Erik van Lunteren, Sarah Spiegler, Michelle Moyer
Contrast between cardiac left ventricle and diaphragm muscle in expression of genes involved in carbohydrate and lipid metabolism.
Respir Physiol Neurobiol: 2008, 161(1);41-53
[PubMed:18207466] ##WORLDCAT## [DOI] (P p)
[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.5 | PubMA_Exercise.6 |
| Analyze GEO data with the Affymetrix software ]