PubMA Exercise.3
Clustering using the GEO Dataset browser (only for data with attached GDS ID)
[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.2 | PubMA_Exercise.3 |
| PubMA_Exercise.4 ]
Contents
Introduction
The GEO Dataset Browser [1] takes care of providing users clustered data and heatmaps derived from manual curation of some of the GEO datasets selected by the NCBI team. This processing is otherwise laborious and requires [R] skills that not every scientist can build. More info about this toolset can be found on the NCBI help page[2]
The web interface allows finding co-expressed genes in clusters. Co-expressed genes often belong to common signaling pathways or result from transcriptional co-regulation by common transcription factors.
The GEO Dataset browser
The remaining of this page shows key view obtained by navigating in the GEO Dataset browser Web interface
Start the tool
Start by searching NCBI's Bioproject or GEO Datasets database using the ID GSE6943. On the resulting Bioproject page you can follow the link to the GEO data sets that correspond to this microarray experiment.
Select GEO Dataset Browser data
The above link brings you to the GEO DataSets page shown next. The first reference links to GEO2R to identify DE genes while the second refers to the GEO Dataset Browser for clustering and creation of a heat map.
The Analyze DataSet link opens a new tab
You can find a full description of GEO DataSet records on NCBI's About GEO Data Sets page.
Download the full data table
The Download section in the right menu allows you to download the full normalized data set (Series Matrix file) in various formats
If you do the download, the resulting text file contains a 49 lines header which may pose problems in Excel
See how to split the file in two using some CLI magic
# download the file from the 'DataSet SOFT file' link in the main window wget ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS3nnn/GDS3224/soft/GDS3224.soft.gz # decompress and replace line ends bby valid unix CR gunzip -c GDS3224.soft.gz | tr "\r" "\n" > GDS3224.soft.txt # find table line#1 grep -n 'dataset_table_begin' GDS3224.soft.txt 49:!dataset_table_begin # we need only the number 49! header_end=$(grep -n 'dataset_table_begin' GDS3224.soft.txt | awk 'BEGIN{FS=":"}{print $1}') # show the result echo ${header_end} 49 # split the file in two head -${header_end} GDS3224.soft.txt > GDS3224.data-header.txt cat GDS3224.soft.txt | sed -e "1,${header_end}d" > GDS3224.data.txt # inspect results grep ^#GSM GDS3224.data-header.txt #GSM160089 = Value for GSM160089: Diaphragm 1; src: Diaphragm #GSM160090 = Value for GSM160090: Diaphragm 2; src: Diaphragm #GSM160091 = Value for GSM160091: Diaphragm 3; src: Diaphragm #GSM160092 = Value for GSM160092: Diaphragm 4; src: Diaphragm #GSM160093 = Value for GSM160093: Diaphragm 5; src: Diaphragm #GSM160094 = Value for GSM160094: Diaphragm 6; src: Diaphragm #GSM160095 = Value for GSM160095: Heart 1; src: Heart (left vent) #GSM160096 = Value for GSM160096: Heart 2; src: Heart (left vent) #GSM160097 = Value for GSM160097: Heart 3; src: Heart (left vent) #GSM160098 = Value for GSM160098: Heart 4; src: Heart (left vent) #GSM160099 = Value for GSM160099: Heart 5; src: Heart (left vent) #GSM160100 = Value for GSM160100: Heart 6; src: Heart (left vent) head GDS3224.data.txt | column -t ID_REF IDENTIFIER GSM160089 GSM160090 GSM160091 GSM160092 GSM160093 GSM160094 GSM160095 GSM160096 GSM160097 GSM160098 GSM160099 GSM160100 1367452_at Sumo2 2532.9 2518.6 2384.6 2304 2360 2482.8 3166 2938.9 2953.3 2558.8 3043.3 2711.5 1367453_at Cdc37 3464.2 3197.4 3487.1 3133.2 3432.5 3486.9 3860.2 3429.4 3381.2 4131.3 4364.2 3605.2 1367454_at Copb2 1620.8 1870.5 1538.6 1334 1502.9 1520.3 1849.8 1852 1858.5 1483.3 1766.1 1500.5 1367455_at Vcp 5512.5 4103.9 5746.5 4393.6 5870.7 5851.2 5408.8 4682.6 4734.7 4403.7 3940 3674.5 1367456_at Ube2d3 6090.8 5352.2 5614.9 5249.6 5834.6 5915.9 3995.3 4356.9 4675.1 4570.4 4994.8 4231.6 1367457_at Becn1 1093.9 1134.3 736.4 1219 774.9 712.2 892.1 998.9 782.3 710.8 923.2 975 1367458_at Lypla2 347.8 223.9 261.4 338.8 249.6 363.7 422.2 409.9 273.3 492.2 458 403.4 1367459_at Arf1 7665.8 7415.9 7075.9 7349.4 6406.7 6664.6 10400.5 9729.2 9679.2 9996.8 9783.7 8333 1367460_at Gdi2 3155.7 2946.9 3589.7 3487.4 3131.5 3338 3198 2991.3 2781 2754.1 2406.9 2609.5
Walking through the tools
Several built-in tools are ready to serve you on demand. We provide below a rapid overview of the results obtained by each tool.
Find Genes
Allows users interested in a particular gene to get its expression values across samples.
Compare two sets of samples
This allows selecting two groups of samples that are compared using a t-test.
You can choose between:
- a two-tailed comparison: you check if the two groups have different expression
- a one-tailed comparison: you check if expression in group 1 is higher than in group 2
- a one-tailed comparison: you check if expression in group 1 is smaller than in group 2
- value means difference test: you check if the mean fold change is exceeding a threshold, e.g. more than two-fold difference between the groups
- rank means difference test: test based on ranks of fold changes, you test if the mean rank difference is exceeding a threshold, e.g. more than two-fold
Remember that ordinary t-tests are not well-suited for microarray data because the assumptions of the t-test are not met (equal variances between the groups), microarray experiments are very noisy and there are too few biological replicates to make accurate estimates of means and variations.
Ideally, you use a modified t-test (as in limma) to compare expression levels. In this tool of the GEO browser, they use a regular t-test without multiple testing correction (as far as I can conclude from the very concise documentation). You can set the significance level and not a FDR. This means that if you set the significance level to the lowest value (0,01) you allow 1% false positives. The rat array represents around 10000 genes so you'll do 10000 t-tests. With an alpha level of 0,01 this will generate 100 false positives.
Although the t-test in this tool is far from ideal, the two other tests also have their limitations, mainly the fact that they rely on arbitrary thresholds (3-fold? 2-fold?).
What about the choice between a two- or a one-tailed t-test? Although the one-tailed test is tempting since it is less stringent than the two-tailed test (the threshold for calling a t-statistic significantly different from 0 is lower in a one-sided test), you are only allowed to use a one-tailed test if you know upfront (before you do the experiment) that you are only interested in one-sided differences. For instance, if you make a knock-out, you may assume that the gene will be downregulated in the mutant or not changing expression (if the mutant is not ok) but you do not expect to find overexpression of the gene you knocked out. In this case you could use a one-tailed test to check if the gene is properly knocked out.
Two tails comparison
Select the two-tailed test, with the lowest significance level (0,01). The groups are defined using the mouse.
The comparison is based on the selected test method. The choice of two tails allows to find DE genes, either up- or downregulated.
The result is a very long list of DE genes with barplot view on the right confirming the difference in expression between the two groups. However, in this format the list is not really useful
Luckily, the right menus allow us to do more usefull things with the selected genes
1. Profile data
This lets you download differentially expressed data selected by the tool
2. Profile pathways
Search pathways enriched in the list of DE genes
Column legends
Rank mean difference "4x & either"
The two-tailed t-test found too many DE genes to be specific for given pathways. This is because of the experimental setup, comparing two completely different tissues (heart and diaphragm). In most microarray experiments, the groups that you compare are not that different and you will find less DE genes. The only way we can reduce this number is by doing a test that evaluates fold changes and selecting a high threshold, e.g. the rank mean difference test with a 4-fold difference threshold. Now, we get only the 152 genes with the most extreme differences in expression between heart and diaphragm (at least a 4-fold difference).
Search pathways enriched in the obtained subset
The pathways are now very specific to heart biology as expected
Cluster heatmaps
Two options are presented in this part, the hierarchical clustering and the K-Mean approaches.
Hierarchical clustering
Hierarchical clustering and heatmaps are classical representations of differential expression that nicely provide high level view on the data.
You may choose between Euclidean distance and Pearson correlation as measures for the distance between two genes (expression profiles). You must realize that the data are not standardized by this tool so both distance measures will measure different things:
- Euclidean distance will sum up absolute differences between expression levels: it groups genes that have similar expression VALUES
- Pearson correlation will measure the similarity of the expression profiles: it groups genes that have similar expression BEHAVIOUR, up in the same samples and down in the same samples
So the two distance measures will result in different trees, as you can see in the figure below showing the immediate neighbours of Creg1 (represented twice on the array) in the resulting tree.
You see that Euclidean distance groups genes with very similar expression values (to the same extent high or low), while Pearson correlation focusses on trends (high in the same samples regardless of how high exactly) and that they result in different groupings.
So it's important to make the right choice of distance measure for what you want to know !
You see that you also have to choose how to define the distance between clusters. This choice is not as important as the choice between Euclidean distance and Pearson correlation. The difference between the three proposed methods is:
- average: you measure the distances between cluster centers, the standard choice
- complete: you measure the distance between the farthest members of each cluster
- single: you measure the distance between the closest members of each cluster
A number of interactive links allow changing the correlation method, changing colors, and/or selecting heatmap regions to plot other graphs or export data to file
K-Means clustering
.
If you run K-means repeatedly, it will always return different clusters. So you will not get the same results as the ones below!!! BUT you will find back mostly the same genes in similar clusters
The user decides arbitrarily the number of clusters to be found. Too few will lead to mixed profiles while too many will lead to apparent redundancy.
From the obtained solution, it is clear that the four clusters are grouping genes having clear trends in both sample groups. Groups #1 and #3 being highly contrasted between Diaphragm and Heart while #2 and #4 are not/more subtle DE groups
Clicking on the first group to get more details for genes higher in the Diaphragm
Similarly, for genes higher in the Heart
Finally, zooming on a small number of genes allows plotting their profiles and exporting the results to file
Experiment design and value distribution
This QC tool plots box-plot for each sample and allows evaluating the global quality of the dataset. A good dataset has a constant median line and similar distributions.
download exercise files
Download exercise files here.
References:
- ↑ http://www.ncbi.nlm.nih.gov/sites/GDSbrowser/
- ↑ http://www.ncbi.nlm.nih.gov/geo/info/datasets.html
[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.2 | PubMA_Exercise.3 |
| PubMA_Exercise.4 ]