Tuesday, November 27, 2012

Fast Clustering in R and byte comilation.

Ran into a problem with having to cluster a large data set for creating heatmaps in R. Normally hclust() is not bad for small datasets (less then 1000 rows) but rapidly gets time consuming as the data size grows. A quick dig into the complexity reveals hclust() to be Θ(N³). Found an R package (“fastcluster”) which does significantly better for most clustering methods(single, complete, avg, weighted) being Θ(N²). Now that’s a pretty decent improvement. Still, with a dataset dim(45.000 , 200). I’m still looking at a few hours….:( After some digging found that R released a byte compilation library in version 2.13.0. It essentially lets to compile you function into binary before execution which can speed things up to five fold. http://dirk.eddelbuettel.com/blog/2011/04/12/ The implementation is pretty straight forward. myFunction <- function(); library(compiler) cmyFunction <- cmpfun(myFunction);

Fastcluster

Significant faster clustering algorithm for R and python.
http://math.stanford.edu/~muellner/fastcluster.html#2

Quick test on RStudio.swmed.edu:

system.time(pheatmap(test,color=hmcols,cluster_rows=TRUE,cluster_cols=FALSE,legend=FALSE,show_rownames=FALSE,show_colnames=FALSE,clustering_method="complete",clustering_distance_rows="correlation"))
238.219 seconds
 
library("fastcluster")
system.time(pheatmap(test,color=hmcols,cluster_rows=TRUE,cluster_cols=FALSE,legend=FALSE,show_rownames=FALSE,show_colnames=FALSE,clustering_method="complete",clustering_distance_rows="correlation"))
28.176 seconds

HeatMaps with R (Chip-Seq)

In order to create a heatmap of tag counts from chip-seq data using R first generate the data matrix using Homer's annotatePeak.pl

annotatePeaks.pl TSS_hg19_8kb.bed hg19 -hist 100 -ghist -d TagDirectory > output

where 'tss_hg19_8kb.bed' is the location plus minus 8kb surround the tss of all refseq genes downloaded from UCSC table browser.

Next import data in R.

Set color: 

hmcols<-colorRampPalette(c("white","red"))(256)

Create heatmap:

 pheatmap(test,color=hmcols,cluster_rows=TRUE,cluster_cols=FALSE,legend=FALSE,show_rownames=FALSE,show_colnames=FALSE)
  
note: the original heatmap() function in R does a scaling on the values resulting in scaled representation of values.
note: labRow="" and labCol="" gets rid of the x-axis and y-axis labels.
note: scale="row"  scales data by row.