PubMA Exercise.2
Compute differential analysis using GEO2R within the NCBI web-portal
[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.1 | PubMA_Exercise.2 |
| PubMA_Exercise.2b | PubMA_Exercise.3 ]
Analyze public GEO data on the NCBI portal
The GEO portal links to several web-tools allowing data analysis without the need to install anything on your computer. Although these tools will not compete with sophisticated R/Bioconductor methods, they remain very attractive as they do not require prior knowledge in MA data analysis and are very fast, leading the users to tabular results and pictures that can be fed to other tools or used as is in scientific reports. We proceed here with GEO2R which allows finding differentially expressed genes by comparing sample groups within one GEO submission. Full instructions[1] - (Tutorial video )
GEO2R step-by-step walk-through for GSE6943
Although former HowTo page Analyze_GEO_data_with_GEO2R is already present on the BITS Wiki, we repeat the analysis here with the same dataset used in the CLC main workbench to be able to compare results of free and commercial solutions. This work was published by 'van Lunteren E, Spiegler S, Moyer M' [2]. Full details about this dataset can be found on the http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE6943 GEO page.
Go to the GEO2R website and fill in the GEO accession: GSE6943 in the search box.
The GEO2R interface
The initial window shows several TABs that will be reviewed in the remaining of this tutorial.
GEO2R sample definition
The first step in the GEO2R analysis is performed by clicking on Define groups to setup sample groups based on available samples and label them. These groups will be used to define contrasts and compute differential expression. Two groups are created with names 'diaphragm' and 'heart' and samples are attributed to a certain group using the mouse.
Samples are attributed to a group by selecting them: e.g. select all diaphragm samples by holding the Shift key during the selection. Click the diaphragm group to attribute the samples to this group. Do the same for the heart samples.
The order in which you assign the groups is important. First define the treated group (it will be colored in blue), then define the control group (it will be colored in pink). The order is important for calculating log fold changes later in the analysis. If you reverse the order: genes that are upregulated according to the publication that supports the data will be downregulated in your results and vice versa
The list of samples in each group can be reviewed by clicking on List in the group definition popup window.
Visualize the distribution of log-transformed expression values
Before proceeding with DE analysis, it is very important to first control for sample value distribution homogeneity in the 'Value distribution' TAB.
This plot represents the distribution of the data in all samples.
Since the data is supposed to be normalized you expect comparable boxes for all samples. When box plots show large divergence, it might point to the fact that the data in the Series Matrix file was not yet normalized. Unfortunately you cannot perform normalization in GEO2R. If the boxes are very different, then it is not possible to compare the samples.
Search for the top 250 differentially expressed transcripts
Since the boxplots show that the data has been normalized, we can now proceed with finding DE genes (top-250 being a good proxy for downstream analysis) between the two groups.
Options can be set in the Options tab to handle log transformation and multiple testing correction to be applied to the data.
The default Options are shown below and are the best choice for most data sets. A FDR is used to correct for multiple testing, GEO2R will check itself if the data is log transformed or not and will perform a log transformation if necessary. Info from NCBI will be used to link the probe sets (genes) to annotation like GO descriptions, chromosome location...
When satisfied, go to the GEO2R tab and click the Top250 button to run a limma analysis for identifying DE genes.
When more than two grous are defined, GEO2R selects pairwise contrasts in a triangular/circular way (depending on the number of groups). These contrasts are labelled with arbitrary names (G0, G1, ... Gn) and do not always reflect the user expectation but there is unfortunately little to be done in GEO2R to control this choice; BUT more can be done when post-processing the code in RStudio as will be shown in the dedicated tutorial
The limma analysis results in a list of the 250 transcripts with the lowest p-values (ranked by increasing p-value).
The results table contains the following columns:
- adj.P.Val: p-value after correction for multiple testing.
This column is the statistic you should use for interpreting the results. Genes with the smallest adjusted p-values are the most reliable. Selecting all probe sets with adjusted p-values < 0.05 is equivalent to setting the False Discovery Rate (FDR) to 0,05 allowing 5% of the selected DE genes to be false positives.
GEO2R always shows the 250 genes with the lowest p-values, regardless of the significance of their p-values. Sometimes 250 is not enough and you miss DE genes (as is the case in this example), sometimes 250 is too much and only a fraction of these 250 genes is really DE. So always check the adjusted p-values to decide how many genes of these 250 you are going to use for further analysis. - P.Value: raw p-value before multiple testing correction
- t: t-statistic of the shrunken t-test
- B: B-statistic or log-odds that the gene is differentially expressed
- logFC: Log2-fold change between the two experimental conditions
This table contains links through which detailed expression information can be retrieved for interesting genes (not further detailed here).
Clicking on 'Save all results' will open a new window with the full table that can be saved to disk as a tab-separated text file using the browser File Save option
If you wish to upload this table to Ingenuity Pathway Analysis (IPA), you may consider opening it first in Microsoft Excel and save it back as a '.xls' file. This will remove the double quotes around fields and allow better recognition of your data by IPA
Saving the Rscript for further use in RStudio
This is the last step of this tutorial and the first step of the follow-up page PubMA_Exercise.2b where we will produce an R script to perform the GEO2R analysis on our own computer and prepare for more advanced microarray analyses.
When you click the R script tab you see the R-code that GEO2R uses to analyze the data and to generate the box plot and the list of DE genes.
download exercise files
Download exercise files here.
References:
- ↑ https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html
- ↑
Erik van Lunteren, Sarah Spiegler, Michelle Moyer
Contrast between cardiac left ventricle and diaphragm muscle in expression of genes involved in carbohydrate and lipid metabolism.
Respir Physiol Neurobiol: 2008, 161(1);41-53
[PubMed:18207466] ##WORLDCAT## [DOI] (P p)
[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.1 | PubMA_Exercise.2 |
| PubMA_Exercise.2b | PubMA_Exercise.3 ]