PubMA Exercise.2b
Follow-up analysis demo in RStudio
[ Main_Page | PubMA_Exercise.1 | PubMA_Exercise.2 |
| PubMA_Exercise.3 ]
Contents
Introduction
The former exercise produced a full table with differential expression between Heart and Diaphragm samples that can be used directly with Functional enrichment tools or IPA. The R-script used by GEO2R is quite standard and basic and only provides BoxPlots for signal distribution. This is a minimum when doing MA data analysis and users often appreciate to have some more QC done on the data to control for biases or inconsistencies associated with artifacts or with variability in the experiment.
Expert analyst makes use of the [R]-Bioconductor toolbox to further analyze their data. Some examples of standard R functions are shown below to show you the power of this programing language. The following exercise is provided as an appetizer and is far not exhaustive, you will find many more methods and tools by exploring the CRAN documentation (http://cran.r-project.org[1])
installing on a non-BITS laptop
required for non-BITS laptops
# If running a unix OS computer please install 'libxml2-devel and libcurl-devel' required to build dependencies # install minimal bioconductor source("http://bioconductor.org/biocLite.R") biocLite() # Add required packages for today's session (will add RCurl, XML, ... dependencies) ## biocLite(c("Biobase", "GEOquery", "limma", "affy")) # some other packages are not required here but are required for RobiNA in other exercises # these required packages were identified by parsing all R scripts present in the windows build of Robina ## biocLite (c("affy", "affyPLM", "RankProd", "limma", "gcrma", "statmod", "marray", "plier", "makecdfenv", "Biostrings", "edgeR", "DESeq", "EDASeq"))
Extend the GEO2R analysis in R with RStudio
RStudio is the current best graphical environment to program i the [R] language and offers many facilities to learn and develop your R skills. The program is available free of charge from http://www.rstudio.com[2].
adapt the GEO2R script in RStudio
Original GEO2R code
The starting script is first reproduced here
initial GEO2R code
# Version info: R 2.14.1, Biobase 2.15.3, GEOquery 2.23.2, limma 3.10.1 # R scripts generated Tue Aug 12 05:30:54 EDT 2014 ################################################################ # Differential expression analysis with limma library(Biobase) library(GEOquery) library(limma) # load series and platform data from GEO ## gset <- getGEO("GSE6943", GSEMatrix =TRUE) gset <- getGEO("GSE6943", GSEMatrix =TRUE) if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1 gset <- gset[[idx]] # make proper column names to match toptable fvarLabels(gset) <- make.names(fvarLabels(gset)) # group names for all samples sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1"); # log2 transform ex <- exprs(gset) qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T)) LogC <- (qx[5] > 100) || (qx[6]-qx[1] > 50 && qx[2] > 0) || (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2) if (LogC) { ex[which(ex <= 0)] <- NaN exprs(gset) <- log2(ex) } # set up the data and proceed with analysis fl <- as.factor(sml) gset$description <- fl design <- model.matrix(~ description + 0, gset) colnames(design) <- levels(fl) fit <- lmFit(gset, design) cont.matrix <- makeContrasts(G1-G0, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2, 0.01) tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250) # load NCBI platform annotation gpl <- annotation(gset) ## platf <- getGEO(gpl, AnnotGPL=TRUE) platf <- getGEO(gpl, AnnotGPL=TRUE) ncbifd <- data.frame(attr(dataTable(platf), "table")) # replace original platform annotation tT <- tT[setdiff(colnames(tT), setdiff(fvarLabels(gset), "ID"))] tT <- merge(tT, ncbifd, by="ID") tT <- tT[order(tT$P.Value), ] # restore correct order tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title")) write.table(tT, file=stdout(), row.names=F, sep="\t") ################################################################ # Boxplot for selected GEO samples library(Biobase) library(GEOquery) # load series and platform data from GEO ## gset <- getGEO("GSE6943", GSEMatrix=TRUE) gset <- getGEO("GSE6943", GSEMatrix=TRUE) if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1 gset <- gset[[idx]] # group names for all samples in a series sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1") # order samples by group ex <- exprs(gset)[ , order(sml)] sml <- sml[order(sml)] fl <- as.factor(sml) labels <- c("Diaphragm","Heart") # set parameters and draw the plot palette(c("#dfeaf4","#f4dfdf", "#AABBCC")) dev.new(width=4+dim(gset)[[2]]/5, height=6) par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1)) title <- paste ("GSE6943", '/', annotation(gset), " selected samples", sep ='') boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl) legend("topleft", labels, fill=palette(), bty="n")
Improved code
The next script was adapted to keep local files and to save results to file instead of sanding them to a new browser window. Edits are shown where original lines are commented out by '##'
- 'destdir = base' saves the downloads to the local folder instead of to the temp directory
- 'number=nrow(fit2)' generates the full table instead of the top250
We now show the edited script where lines starting with ## are modified in the next line(s)
## added configuration base <- "~/PUBMA2014/ex2b-files", sep="") setwd(base) # Version info: R 2.14.1, Biobase 2.15.3, GEOquery 2.23.2, limma 3.10.1 # R scripts generated Tue Aug 12 05:30:54 EDT 2014 ################################################################ # Differential expression analysis with limma library(Biobase) library(GEOquery) library(limma) # load series and platform data from GEO ## gset <- getGEO("GSE6943", GSEMatrix =TRUE) gset <- getGEO("GSE6943", GSEMatrix =TRUE, destdir = base) if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1 gset <- gset[[idx]] # make proper column names to match toptable fvarLabels(gset) <- make.names(fvarLabels(gset)) # group names for all samples sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1"); # log2 transform ex <- exprs(gset) qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T)) LogC <- (qx[5] > 100) || (qx[6]-qx[1] > 50 && qx[2] > 0) || (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2) if (LogC) { ex[which(ex <= 0)] <- NaN exprs(gset) <- log2(ex) } # set up the data and proceed with analysis fl <- as.factor(sml) gset$description <- fl design <- model.matrix(~ description + 0, gset) colnames(design) <- levels(fl) fit <- lmFit(gset, design) cont.matrix <- makeContrasts(G1-G0, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2, 0.01) ## tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250) tT <- topTable(fit2, adjust="fdr", sort.by="B", number=nrow(fit2)) # load NCBI platform annotation gpl <- annotation(gset) ## platf <- getGEO(gpl, AnnotGPL=TRUE) platf <- getGEO(gpl, AnnotGPL=TRUE, destdir = base) ncbifd <- data.frame(attr(dataTable(platf), "table")) # replace original platform annotation tT <- tT[setdiff(colnames(tT), setdiff(fvarLabels(gset), "ID"))] tT <- merge(tT, ncbifd, by="ID") tT <- tT[order(tT$P.Value), ] # restore correct order ## tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title")) tT.final <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title")) ## write.table(tT.final, file=stdout(), row.names=F, sep="\t") write.table(tT.final, file="GSE6943_DE.txt", row.names=F, sep="\t", quote=FALSE) ################################################################ # Boxplot for selected GEO samples library(Biobase) library(GEOquery) # load series and platform data from GEO ## gset <- getGEO("GSE6943", GSEMatrix=TRUE) gset <- getGEO("GSE6943", GSEMatrix=TRUE, destdir = base) if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1 gset <- gset[[idx]] # group names for all samples in a series sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1") # order samples by group ex <- exprs(gset)[ , order(sml)] sml <- sml[order(sml)] fl <- as.factor(sml) labels <- c("Diaphragm","Heart") # set parameters and draw the plot ## save to file filename <- paste(base, "GSE6943-boxplot.pdf", sep="/") pdf(file = filename, bg = "white") palette(c("#dfeaf4", "#f4dfdf", "#AABBCC")) ## dev.new(width=4+dim(gset)[[2]]/5, height=6) par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1)) title <- paste ("GSE6943", '/', annotation(gset), " sample signal distribution", sep ='') boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl) legend("topleft", labels, fill=palette(), bty="n") dev.off()
## save R workspace for reuse outfile <- paste(base, "Workspace.RData", sep="/") save.image(file = outfile) sessionInfo()
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GEOquery_2.31.1 Biobase_2.25.0 BiocGenerics_0.11.4 limma_3.21.12
loaded via a namespace (and not attached):
[1] RCurl_1.95-4.3 tools_3.1.1 XML_3.98-1.1
download exercise files
Download exercise files here.
References:
[ Main_Page | PubMA_Exercise.1 | PubMA_Exercise.2 |
| PubMA_Exercise.3 ]