Are you a data scientist who loves working with R, mainly because it is free and open source? However do you run into a wall when the datasets you are working with are really large? Well, we have a solution for you. Our faculty Kiran P.V who is a Big Data expert, will in this post, tell you about some popular R Packages that are great for working with distributed Big Data platforms.

Open source R is a popular statistical tool with added support from vast number of packages. Currently, there are about 5000 external R packages and the number is constantly growing, with dynamic community support.  With the help of these packages, we can perform almost any given task within the framework of data science. Hearing this one would wonder; if R advantages are so wide spread, why is it not dominating the statistical tool market across the world? Certainly it is making all the moves in that direction, but still has a lot of ground to catch up before becoming the number one choice amongst statistical tools.

Well, one of the major drawbacks of the open source R version is its inability to handle big datasets efficiently. Generally data sets containing up to one million to few hundred million records can be managed in standard R. However, the problem surfaces when you need to handle data sets that contain one billion records or more, and these days’ most companies doing Big Data do deal with this much data.

The problem lies in the in-memory processing power of R which is generally dependent on the RAM capabilities of the system it is running on. One can say that for any given data set, R data processing effectiveness would increase as the RAM capability of any system increases like an 8 GB RAM system performs better than a 4 GB RAM system.

However there do exist certain alternatives, which would allow R to handle big data sets more efficiently; like performing in-database analytics, leveraging multi-core parallel processing frameworks, using distributed computing mapreduce frameworks or depending on Revolution R Enterprise versions which are commercial versions of open source R.

Of all these, I will in this article focus mainly on the use of distributed computing mapreduce based frameworks such as Hadoop, MongoDB, or other NoSQL platforms which are currently popular database technologies for effectively handling big data at enterprise level.

Fortunately, with a lot of community support, there are specific R packages which would help us integrate open source R with the above mentioned Big Data frameworks such as Hadoop, MongoDB etc.  Let’s begin by first ascertaining what we need to know in order to effectively use these packages:

  1. Working knowledge of big data platforms like Hadoop, MongoDB etc
  2. Implementation level understanding on MapReduce framework
  3. Hands on exposure to R Programming
  4. Fundamental aspects of handling RDBMS databases

If we do have the above knowledge and skills, we can use the following Big Data R packages to integrate R in Big Data frameworks:

Out of the above listed packages, RHIPE is one of the most popular solutions available and integrates R with Hadoop cluster especially HDFS and MapReduce to carry out big and parallel computations.

The other set of R packages such as ravro, plyrmr, rmr, rhdfs and rhbase, come under one group i.e. RHadoop library framework. RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. These set of packages are regularly tested on recent releases of the Cloudera and Hortonworks Hadoop distributions and would have broad compatibility with open source Hadoop and mapR’s distribution.

Let’s take a quick look at what each of these R packages under the RHadoop library framework, are used for:

  • ravro is used to read and write files in avro format
  • plyrmr is used to perform higher level data manipulation tasks on structured data present in Hadoop cluster
  • rmr is used to write mapreduce functions in R syntax and commonly used as an alternative to java based mapreduce programming
  • rhdfs helps in file management aspects of HDFS from within R
  • rhbase helps in providing database management of HBase from within R.

The rmongodb package provides an interface from open source R to MongoDB and back using the mongodb-C library.

The last two packages, RHive and RImpala would help to connect with RDBMS/Structured components of Hadoop Ecosystem namely Hive and Impala in order to effectively process Hadoop data using Hive Queries. The best part of these packages is the ability with which one can run the required queries on Hadoop data from within the R console itself. Surely, as new big data technologies come into business, more and more of these R packages would be developed by the dynamic R community in order to make it the number one choice of statistical tool in the field of big data.

Suggested Read:

Faster Versions of R

Using Pipes in R

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.
Jigsaw’s Data Science with SAS Course – click here.
Jigsaw’s Data Science with R Course – click here.
Jigsaw’s Big Data Course – click here.

Are you ready to build your own career?