Tuesday, November 27, 2012
Fast Clustering in R and byte comilation.
Ran into a problem with having to cluster a large data set for creating heatmaps in R. Normally hclust() is not bad for small datasets (less then 1000 rows) but rapidly gets time consuming as the data size grows. A quick dig into the complexity reveals hclust() to be Θ(N³).
Found an R package (“fastcluster”) which does significantly better for most clustering methods(single, complete, avg, weighted) being Θ(N²). Now that’s a pretty decent improvement.
Still, with a dataset dim(45.000 , 200). I’m still looking at a few hours….:(
After some digging found that R released a byte compilation library in version 2.13.0. It essentially lets to compile you function into binary before execution which can speed things up to five fold.
http://dirk.eddelbuettel.com/blog/2011/04/12/
The implementation is pretty straight forward.
myFunction <- function();
library(compiler)
cmyFunction <- cmpfun(myFunction);
Fastcluster
Significant faster clustering algorithm for R and python.
http://math.stanford.edu/~muellner/fastcluster.html#2
Quick test on RStudio.swmed.edu:
http://math.stanford.edu/~muellner/fastcluster.html#2
Quick test on RStudio.swmed.edu:
system.time(pheatmap(test,color=hmcols,cluster_rows=TRUE,cluster_cols=FALSE,legend=FALSE,show_rownames=FALSE,show_colnames=FALSE,clustering_method="complete",clustering_distance_rows="correlation"))
238.219 seconds
library("fastcluster")
system.time(pheatmap(test,color=hmcols,cluster_rows=TRUE,cluster_cols=FALSE,legend=FALSE,show_rownames=FALSE,show_colnames=FALSE,clustering_method="complete",clustering_distance_rows="correlation"))
28.176 seconds
HeatMaps with R (Chip-Seq)
In order to create a heatmap of tag counts from chip-seq data using R first generate the data matrix using Homer's annotatePeak.pl
annotatePeaks.pl TSS_hg19_8kb.bed hg19 -hist 100 -ghist -d TagDirectory > output
where 'tss_hg19_8kb.bed' is the location plus minus 8kb surround the tss of all refseq genes downloaded from UCSC table browser.
Next import data in R.
Set color:
hmcols<-colorRampPalette(c("white","red"))(256)
Create heatmap:
annotatePeaks.pl TSS_hg19_8kb.bed hg19 -hist 100 -ghist -d TagDirectory > output
where 'tss_hg19_8kb.bed' is the location plus minus 8kb surround the tss of all refseq genes downloaded from UCSC table browser.
Next import data in R.
Set color:
hmcols<-colorRampPalette(c("white","red"))(256)
Create heatmap:
pheatmap(test,color=hmcols,cluster_rows=TRUE,cluster_cols=FALSE,legend=FALSE,show_rownames=FALSE,show_colnames=FALSE)
note: the original heatmap() function in R does a scaling on the values resulting in scaled representation of values.
note: labRow="" and labCol="" gets rid of the x-axis and y-axis labels.
note: labRow="" and labCol="" gets rid of the x-axis and y-axis labels.
note: scale="row" scales data by row.
Subscribe to:
Posts (Atom)