Clustering using the GEO Dataset browser (only for data with attached GDS ID)

Introduction

The GEO Dataset Browser ^[1] takes care of providing users clustered data and heatmaps derived from manual curation of some of the GEO datasets selected by the NCBI team. This processing is otherwise laborious and requires [R] skills that not every scientist can build. More info about this toolset can be found on the NCBI help page^[2]

The web interface allows finding co-expressed genes in clusters. Co-expressed genes often belong to common signaling pathways or result from transcriptional co-regulation by common transcription factors.

The GEO Dataset browser

The remaining of this page shows key view obtained by navigating in the GEO Dataset browser Web interface

Start the tool

Start by searching NCBI's Bioproject or GEO Datasets database using the ID GSE6943. On the resulting Bioproject page you can follow the link to the GEO data sets that correspond to this microarray experiment.

Select GEO Dataset Browser data

The above link brings you to the GEO DataSets page shown next. The first reference links to GEO2R to identify DE genes while the second refers to the GEO Dataset Browser for clustering and creation of a heat map.

The Analyze DataSet link opens a new tab

You can find a full description of GEO DataSet records on NCBI's About GEO Data Sets page.

Download the full data table

The Download section in the right menu allows you to download the full normalized data set (Series Matrix file) in various formats

If you do the download, the resulting text file contains a 49 lines header which may pose problems in Excel

See how to split the file in two using some CLI magic

# download the file from the 'DataSet SOFT file' link in the main window
wget ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS3nnn/GDS3224/soft/GDS3224.soft.gz
 
# decompress and replace line ends bby valid unix CR
gunzip -c GDS3224.soft.gz | tr "\r" "\n" > GDS3224.soft.txt
 
# find table line#1
grep -n 'dataset_table_begin' GDS3224.soft.txt
49:!dataset_table_begin
 
# we need only the number 49!
header_end=$(grep -n 'dataset_table_begin' GDS3224.soft.txt | awk 'BEGIN{FS=":"}{print $1}')
 
# show the result
echo ${header_end}
49
 
# split the file in two
head -${header_end} GDS3224.soft.txt > GDS3224.data-header.txt
cat GDS3224.soft.txt | sed -e "1,${header_end}d" > GDS3224.data.txt
 
# inspect results
grep ^#GSM GDS3224.data-header.txt
#GSM160089 = Value for GSM160089: Diaphragm 1; src: Diaphragm
#GSM160090 = Value for GSM160090: Diaphragm 2; src: Diaphragm
#GSM160091 = Value for GSM160091: Diaphragm 3; src: Diaphragm
#GSM160092 = Value for GSM160092: Diaphragm 4; src: Diaphragm
#GSM160093 = Value for GSM160093: Diaphragm 5; src: Diaphragm
#GSM160094 = Value for GSM160094: Diaphragm 6; src: Diaphragm
#GSM160095 = Value for GSM160095: Heart 1; src: Heart (left vent)
#GSM160096 = Value for GSM160096: Heart 2; src: Heart (left vent)
#GSM160097 = Value for GSM160097: Heart 3; src: Heart (left vent)
#GSM160098 = Value for GSM160098: Heart 4; src: Heart (left vent)
#GSM160099 = Value for GSM160099: Heart 5; src: Heart (left vent)
#GSM160100 = Value for GSM160100: Heart 6; src: Heart (left vent)
 
head GDS3224.data.txt | column -t
ID_REF      IDENTIFIER  GSM160089  GSM160090  GSM160091  GSM160092  GSM160093  GSM160094  GSM160095  GSM160096  GSM160097  GSM160098  GSM160099  GSM160100
1367452_at  Sumo2       2532.9     2518.6     2384.6     2304       2360       2482.8     3166       2938.9     2953.3     2558.8     3043.3     2711.5
1367453_at  Cdc37       3464.2     3197.4     3487.1     3133.2     3432.5     3486.9     3860.2     3429.4     3381.2     4131.3     4364.2     3605.2
1367454_at  Copb2       1620.8     1870.5     1538.6     1334       1502.9     1520.3     1849.8     1852       1858.5     1483.3     1766.1     1500.5
1367455_at  Vcp         5512.5     4103.9     5746.5     4393.6     5870.7     5851.2     5408.8     4682.6     4734.7     4403.7     3940       3674.5
1367456_at  Ube2d3      6090.8     5352.2     5614.9     5249.6     5834.6     5915.9     3995.3     4356.9     4675.1     4570.4     4994.8     4231.6
1367457_at  Becn1       1093.9     1134.3     736.4      1219       774.9      712.2      892.1      998.9      782.3      710.8      923.2      975
1367458_at  Lypla2      347.8      223.9      261.4      338.8      249.6      363.7      422.2      409.9      273.3      492.2      458        403.4
1367459_at  Arf1        7665.8     7415.9     7075.9     7349.4     6406.7     6664.6     10400.5    9729.2     9679.2     9996.8     9783.7     8333
1367460_at  Gdi2        3155.7     2946.9     3589.7     3487.4     3131.5     3338       3198       2991.3     2781       2754.1     2406.9     2609.5

Walking through the tools

Several built-in tools are ready to serve you on demand. We provide below a rapid overview of the results obtained by each tool.

Find Genes

Allows users interested in a particular gene to get its expression values across samples.

Compare two sets of samples

This allows selecting two groups of samples that are compared using a t-test.

You can choose between:

a two-tailed comparison: you check if the two groups have different expression
a one-tailed comparison: you check if expression in group 1 is higher than in group 2
a one-tailed comparison: you check if expression in group 1 is smaller than in group 2
value means difference test: you check if the mean fold change is exceeding a threshold, e.g. more than two-fold difference between the groups
rank means difference test: test based on ranks of fold changes, you test if the mean rank difference is exceeding a threshold, e.g. more than two-fold

Remember that ordinary t-tests are not well-suited for microarray data because the assumptions of the t-test are not met (equal variances between the groups), microarray experiments are very noisy and there are too few biological replicates to make accurate estimates of means and variations.

Ideally, you use a modified t-test (as in limma) to compare expression levels. In this tool of the GEO browser, they use a regular t-test without multiple testing correction (as far as I can conclude from the very concise documentation). You can set the significance level and not a FDR. This means that if you set the significance level to the lowest value (0,01) you allow 1% false positives. The rat array represents around 10000 genes so you'll do 10000 t-tests. With an alpha level of 0,01 this will generate 100 false positives.

Although the t-test in this tool is far from ideal, the two other tests also have their limitations, mainly the fact that they rely on arbitrary thresholds (3-fold? 2-fold?).

What about the choice between a two- or a one-tailed t-test? Although the one-tailed test is tempting since it is less stringent than the two-tailed test (the threshold for calling a t-statistic significantly different from 0 is lower in a one-sided test), you are only allowed to use a one-tailed test if you know upfront (before you do the experiment) that you are only interested in one-sided differences. For instance, if you make a knock-out, you may assume that the gene will be downregulated in the mutant or not changing expression (if the mutant is not ok) but you do not expect to find overexpression of the gene you knocked out. In this case you could use a one-tailed test to check if the gene is properly knocked out.

Two tails comparison

Select the two-tailed test, with the lowest significance level (0,01). The groups are defined using the mouse.

The comparison is based on the selected test method. The choice of two tails allows to find DE genes, either up- or downregulated.

The result is a very long list of DE genes with barplot view on the right confirming the difference in expression between the two groups. However, in this format the list is not really useful

Luckily, the right menus allow us to do more usefull things with the selected genes

Filters and Find related data are not detailed here and left to your curiosity

1. Profile data

This lets you download differentially expressed data selected by the tool

2. Profile pathways

Search pathways enriched in the list of DE genes

Column legends

Rank mean difference "4x & either"

The two-tailed t-test found too many DE genes to be specific for given pathways. This is because of the experimental setup, comparing two completely different tissues (heart and diaphragm). In most microarray experiments, the groups that you compare are not that different and you will find less DE genes. The only way we can reduce this number is by doing a test that evaluates fold changes and selecting a high threshold, e.g. the rank mean difference test with a 4-fold difference threshold. Now, we get only the 152 genes with the most extreme differences in expression between heart and diaphragm (at least a 4-fold difference).

Search pathways enriched in the obtained subset

The pathways are now very specific to heart biology as expected

Cluster heatmaps

Two options are presented in this part, the hierarchical clustering and the K-Mean approaches.

Hierarchical clustering

Hierarchical clustering and heatmaps are classical representations of differential expression that nicely provide high level view on the data.

You may choose between Euclidean distance and Pearson correlation as measures for the distance between two genes (expression profiles). You must realize that the data are not standardized by this tool so both distance measures will measure different things:

Euclidean distance will sum up absolute differences between expression levels: it groups genes that have similar expression VALUES
Pearson correlation will measure the similarity of the expression profiles: it groups genes that have similar expression BEHAVIOUR, up in the same samples and down in the same samples

So the two distance measures will result in different trees, as you can see in the figure below showing the immediate neighbours of Creg1 (represented twice on the array) in the resulting tree.

You see that Euclidean distance groups genes with very similar expression values (to the same extent high or low), while Pearson correlation focusses on trends (high in the same samples regardless of how high exactly) and that they result in different groupings.

So it's important to make the right choice of distance measure for what you want to know !

You see that you also have to choose how to define the distance between clusters. This choice is not as important as the choice between Euclidean distance and Pearson correlation. The difference between the three proposed methods is:

average: you measure the distances between cluster centers, the standard choice
complete: you measure the distance between the farthest members of each cluster
single: you measure the distance between the closest members of each cluster

A number of interactive links allow changing the correlation method, changing colors, and/or selecting heatmap regions to plot other graphs or export data to file

K-Means clustering

.

If you run K-means repeatedly, it will always return different clusters. So you will not get the same results as the ones below!!! BUT you will find back mostly the same genes in similar clusters

The user decides arbitrarily the number of clusters to be found. Too few will lead to mixed profiles while too many will lead to apparent redundancy.

From the obtained solution, it is clear that the four clusters are grouping genes having clear trends in both sample groups. Groups #1 and #3 being highly contrasted between Diaphragm and Heart while #2 and #4 are not/more subtle DE groups

Clicking on the first group to get more details for genes higher in the Diaphragm

Similarly, for genes higher in the Heart

Finally, zooming on a small number of genes allows plotting their profiles and exporting the results to file

Experiment design and value distribution

This QC tool plots box-plot for each sample and allows evaluating the global quality of the dataset. A good dataset has a constant median line and similar distributions.

download exercise files

Download exercise files here.

Use the right application to open the files present in ex3-files

References:

[1] ttp://www.ncbi.nlm.nih.gov/sites/GDSbrowser/

[2] ttp://www.ncbi.nlm.nih.gov/geo/info/datasets.html

[1]

[2]

PubMA Exercise.3

Contents

Introduction

The GEO Dataset browser

Start the tool

Select GEO Dataset Browser data

Download the full data table

Walking through the tools

Find Genes

Compare two sets of samples

Two tails comparison

1. Profile data

2. Profile pathways

Rank mean difference "4x & either"

Search pathways enriched in the obtained subset

Cluster heatmaps

Hierarchical clustering

K-Means clustering

Experiment design and value distribution

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox