Let's be Faster and more Parallel in R with doParallel package

Recently and only recently, I have been exposed to large data structures, objects like data frames that are as big as 100MB in size (if you don't know, you can find out the size of an object with object.size(one_object) command). When you come from another background to R, you are mostly used to for loops or foreach loops, however I have come across the beauty of expressiveness of lapply loops. In this blog post, I show you some options to boost performance of loops in R.

Case: Calculating Prime Numbers

Let's imagine a we are computing prime numbers of 10 to 100000 using the following function:

getPrimeNumbers <- function(n) {  
   n <- as.integer(n)
   if(n > 1e6) stop("n too large")
   primes <- rep(TRUE, n)
   primes[1] <- FALSE
   last.prime <- 2L
   for(i in last.prime:floor(sqrt(n)))
   {
      primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
      last.prime <- last.prime + min(which(primes[(last.prime+1):n]))
   }
   which(primes)
}

Note that the function is taken from: http://stackoverflow.com/questions/3789968/generate-a-list-of-primes-in-r-up-to-a-certain-number

Now let's compare performance in each loop type and how they perform:

for vs. lapply

Let's look at lapply:

index <- 10:100000  
result <- lapply(index, getPrimeNumbers(prime))  

This is how you can do it using a for loop:

result <- c()  
index <- 10:100000  
for (i in index) {  
  result[[i]] <- getPrimeNumbers(i)
} 

you might also agree that it makes your code much more beautiful. The apply function is slower in R than native for or for each loops. For example, the for loop finished in 55.4708 seconds in average of 10 runs, while lapply did the same in 57.00911.

But can it be better? I thought not, I was complaining a lot that R is slow and etc. etc. and it is to be honest, but there are rooms for improvements, so let's see.

doParallel::parLapply

Now let's go multi-code:

library(doParallel)  
no_cores <- detectCores() - 1  
registerDoParallel(cores=no_cores)  
cl <- makeCluster(no_cores, type="FORK")  
result <- parLapply(cl, 10:10000, getPrimeNumbers)  
stopCluster(cl)  

The same loop took only 19.38573 in 10 runs. Now, remember that detectCores() finds how many cores you have on your CPU and just to be safe from any RStudio crashing, I used one less core. Also, make sure to invoke stopCluster so you free-up resources.

doParallel::foreach

doParallel.foreach is very fast. The loop only took 14.87837 seconds on average of 10 runs!

doParallel::mclapply

The last function I am going to show-case is the easy-to-use but not-very-impressive, mclapply:

library(doParallel)  
cores <- detectCores() - 1  
mclapply(10:10000, getPrimeNumbers, mc.cores=cores)  

Although you don't need to create clusters like other functions of doParallel, it runs on average around 42.62276 seconds, slightly better than for loop while using more loops but worse than doParallel::foreach or doParallel::parLapply.

Results

Now let's visualize the result using the amazing ggplot2 so that we can see it in a more humanly understandable ways:

loopMethods <- list(c('for', 'lapply', 'doParallel::parLapply', 'doParallel::foreach', 'doParallel::mclapply'))
runtime <- list(c(55.4708, 57.00911, 19.38573, 14.87837, 42.62276))

result <- do.call(rbind, Map(data.frame, A=a, B=b))
colnames(result) <- c('loop type', 'runtime (sec)')

ggplot(result, aes(x = `loop type`, y = `runtime (sec)`)) + theme_bw() + geom_bar(stat = "identity")

Here's how it looks:

How each method performed

Let's Clear Out The Confusion

The reason for using doParallel package is that the older parallel package, parallelization was not working on Windows. doParallel package is trying to make it happen on all platforms: UNIX, LINUX and WINDOWS, so it's a very good wrapper.

The Rstats tag of this blog is added to R Bloggers