Compute differential analysis using GEO2R within the NCBI web-portal

Analyze public GEO data on the NCBI portal

The GEO portal links to several web-tools allowing data analysis without the need to install anything on your computer. Although these tools will not compete with sophisticated R/Bioconductor methods, they remain very attractive as they do not require prior knowledge in MA data analysis and are very fast, leading the users to tabular results and pictures that can be fed to other tools or used as is in scientific reports. We proceed here with GEO2R which allows finding differentially expressed genes by comparing sample groups within one GEO submission. Full instructions^[1] - (Tutorial video )

GEO2R step-by-step walk-through for GSE6943

Although former HowTo page Analyze_GEO_data_with_GEO2R is already present on the BITS Wiki, we repeat the analysis here with the same dataset used in the CLC main workbench to be able to compare results of free and commercial solutions. This work was published by 'van Lunteren E, Spiegler S, Moyer M' ^[2]. Full details about this dataset can be found on the http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE6943 GEO page.

Go to the GEO2R website and fill in the GEO accession: GSE6943 in the search box.

The GEO2R interface

The initial window shows several TABs that will be reviewed in the remaining of this tutorial.

GEO2R sample definition

The first step in the GEO2R analysis is performed by clicking on Define groups to setup sample groups based on available samples and label them. These groups will be used to define contrasts and compute differential expression. Two groups are created with names 'diaphragm' and 'heart' and samples are attributed to a certain group using the mouse.

Samples are attributed to a group by selecting them: e.g. select all diaphragm samples by holding the Shift key during the selection. Click the diaphragm group to attribute the samples to this group. Do the same for the heart samples.

The order in which you assign the groups is important. First define the treated group (it will be colored in blue), then define the control group (it will be colored in pink). The order is important for calculating log fold changes later in the analysis. If you reverse the order: genes that are upregulated according to the publication that supports the data will be downregulated in your results and vice versa

The list of samples in each group can be reviewed by clicking on List in the group definition popup window.

Visualize the distribution of log-transformed expression values

Before proceeding with DE analysis, it is very important to first control for sample value distribution homogeneity in the 'Value distribution' TAB.

This plot represents the distribution of the data in all samples.

Since the data is supposed to be normalized you expect comparable boxes for all samples. When box plots show large divergence, it might point to the fact that the data in the Series Matrix file was not yet normalized. Unfortunately you cannot perform normalization in GEO2R. If the boxes are very different, then it is not possible to compare the samples.

Search for the top 250 differentially expressed transcripts

Since the boxplots show that the data has been normalized, we can now proceed with finding DE genes (top-250 being a good proxy for downstream analysis) between the two groups.

Options can be set in the Options tab to handle log transformation and multiple testing correction to be applied to the data.

The default Options are shown below and are the best choice for most data sets. A FDR is used to correct for multiple testing, GEO2R will check itself if the data is log transformed or not and will perform a log transformation if necessary. Info from NCBI will be used to link the probe sets (genes) to annotation like GO descriptions, chromosome location...

When satisfied, go to the GEO2R tab and click the Top250 button to run a limma analysis for identifying DE genes.

When more than two grous are defined, GEO2R selects pairwise contrasts in a triangular/circular way (depending on the number of groups). These contrasts are labelled with arbitrary names (G0, G1, ... Gn) and do not always reflect the user expectation but there is unfortunately little to be done in GEO2R to control this choice; BUT more can be done when post-processing the code in RStudio as will be shown in the dedicated tutorial

The limma analysis results in a list of the 250 transcripts with the lowest p-values (ranked by increasing p-value).
The results table contains the following columns:

adj.P.Val: p-value after correction for multiple testing.
This column is the statistic you should use for interpreting the results. Genes with the smallest adjusted p-values are the most reliable. Selecting all probe sets with adjusted p-values < 0.05 is equivalent to setting the False Discovery Rate (FDR) to 0,05 allowing 5% of the selected DE genes to be false positives.
GEO2R always shows the 250 genes with the lowest p-values, regardless of the significance of their p-values. Sometimes 250 is not enough and you miss DE genes (as is the case in this example), sometimes 250 is too much and only a fraction of these 250 genes is really DE. So always check the adjusted p-values to decide how many genes of these 250 you are going to use for further analysis.
P.Value: raw p-value before multiple testing correction
t: t-statistic of the shrunken t-test
B: B-statistic or log-odds that the gene is differentially expressed
logFC: Log2-fold change between the two experimental conditions

This table contains links through which detailed expression information can be retrieved for interesting genes (not further detailed here).

Clicking on 'Save all results' will open a new window with the full table that can be saved to disk as a tab-separated text file using the browser File Save option

If you wish to upload this table to Ingenuity Pathway Analysis (IPA), you may consider opening it first in Microsoft Excel and save it back as a '.xls' file. This will remove the double quotes around fields and allow better recognition of your data by IPA

Saving the Rscript for further use in RStudio

This is the last step of this tutorial and the first step of the follow-up page PubMA_Exercise.2b where we will produce an R script to perform the GEO2R analysis on our own computer and prepare for more advanced microarray analyses.
When you click the R script tab you see the R-code that GEO2R uses to analyze the data and to generate the box plot and the list of DE genes.

download exercise files

Download exercise files here.

Use the right application to open the files present in ex2-files

References:

↑ https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html
↑
Erik van Lunteren, Sarah Spiegler, Michelle Moyer
Contrast between cardiac left ventricle and diaphragm muscle in expression of genes involved in carbohydrate and lipid metabolism.
Respir Physiol Neurobiol: 2008, 161(1);41-53
[PubMed:18207466] ##WORLDCAT## [DOI] (P p)

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE6943

[1] ttps://www.ncbi.nlm.nih.gov/geo/info/geo2r.html

[2] 
Erik van Lunteren, Sarah Spiegler, Michelle Moyer
Contrast between cardiac left ventricle and diaphragm muscle in expression of genes involved in carbohydrate and lipid metabolism.
Respir Physiol Neurobiol: 2008, 161(1);41-53
[PubMed:18207466] ##WORLDCAT## [DOI] (P p)

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE6943

[1]

[2]

PubMA Exercise.2

Contents

Analyze public GEO data on the NCBI portal

GEO2R step-by-step walk-through for GSE6943

The GEO2R interface

GEO2R sample definition

Visualize the distribution of log-transformed expression values

Search for the top 250 differentially expressed transcripts

Saving the Rscript for further use in RStudio

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox