blog 13                             

The Need for Big Data?

Statistical softwares like Base R and SAS reads data by default into your memory(RAM).It’s easy to exhaust RAM by storing unnecessary information. Our OS and architecture can access only 4GB of data of memory on 32 bit system. And over dumping of data can slow down your systems.

Hence we need robust systems to handle big data. Our systems should be able to tackle data in terms of Volume, Variety and Velocity of the Data.

The Volume of data is continuously growing. We have huge feeds of data being generated everyday e.g. from social networking sites like your twitter ,Facebook,etc.So Data storage and Data Processing is the need of the hour. Also your conventional databases like Teradata, SQL server, etc. cannot handle huge Volumes of data (Terabytes of data).Velocity signifies the speed with which we can handle data. Another typical problem which data analysts are facing today is to load and analyse unstructured forms of data like text, graphics, etc.

So different kinds of Big Data systems are being designed to handle all these glitches present in our traditional systems.

Why R?

Being an open source software and having a variety of built in statistical commands, R is the most widely used tool by statisticians and Data analysts. There are a lot of new packages being developed and old packages constantly updated in R to support and handle Big Data. This is an advantage mainly for R users as they don’t have to migrate to other platforms to handle their huge volumes of data.

Some Basic Terminology

  • A CPU is the main processing unit in a system. Sometimes CPU refers to the processor, sometimes it refers to a core on a processor. Today, most systems have one processor with between one and four cores. Some have two processors, each with one or more cores. 

  • A cluster is a set of machines that are all interconnected and share resources as if they were one giant computer. 

  • The master is the system that controls the cluster, and a slave or worker is a machine that performs computations and responds to the master’s requests.

Handling Big Data in R
Following are the ways R has come up with solutions to handle Big Data:
1.       Explicit Parallelism: Parallel means having the flexibility to run several computations simultaneously by splitting your data processes across multiple cores. Explicit Parallelism means telling your system explicitly each step to follow. Following are several packages used in R for this:
A.     Rmpi – MPI stands for Message Passing Interface
MPI defines an environment where programs can run in parallel and communicate with each other by passing messages to each other.MPI is so preferred by users because it is so simple to use. Programs just use messages to communicate with each other. But Rmpi requires the user to deal with several issues explicitly like:

  • Sending data and functions etc. to slaves.
  • Querying the master for more tasks
  • Telling the master the slave is done

Rmpi is flexible and versatile for those that are familiar with MPI and clusters in general. For everyone else, it may be a bit difficult to handle. But not to worry, you can see other flexible packages that R is designed to handle big data as you read further.
B.     Snow & SnowFall API:
Snow (Simple network of Workstations) provides an interface to several parallelization packages like MPI, PVM (Parallel Virtual Machines, etc.).All of these systems allow intrasystem communication for working with multiple CPUs, or intersystem communication for working with a cluster.
SnowFall: The Snowfall API uses list functions for parallelization. Calculations are distributed onto workers where each worker gets a portion of the full data to work with. The snowfall API is very similar to the snow API and has the following features:

  • Functions for loading packages and sources in the cluster
  • Functions for exchanging variables between cluster nodes.
  • All wrapper functions contain extended error handling.
  • Changing cluster settings does not require changing R code.
  • Can be done from command line.
  • All functions work in sequential mode as well. Switching between modes requires no change in R code.

C.     sfCluster
Generally in a cluster, all nodes have comparable performance specifications and each node will take approximately the same time to run.sfCluster wants to make sure that your cluster can meet your needs without exhausting resources.
The Advantages of using sfCluster

  • It checks your cluster to find machines with free resources if available. The available machines (or even the available parts of the machine) are built into a new sub-cluster which belongs to the new program.
  • It can optionally monitor the cluster for usage and stop programs if they exceed their memory allotment, disk allotment, or if machines start to swap.
  • It also provides diagnostic information about the current running clusters and free resources on the cluster including PIDs, memory use and runtime.

2.       Implicit Parallelism : Unlike explicit parallelism where the user controls (and can mess up) most of the cluster settings, with implicit parallelism most of the messy legwork in setting up the system and distributing data is avoided.
Following are some of the packages in R that employ implicit parallelism and can be used for handling big data:
A.     Multicore: Provides functions for parallel execution of R code on systems with multiple cores or multiple CPUs. Distinct from other parallelism solutions, multicore jobs all share the same state when spawned. So no data or code needs to be initialized. The actual spawning is very fast as well since no new R instance needs to be started Multicore processing is provided by the doMC library.
Its major drawback is that it cannot be used on Windows because it lacks the fork system call. Used extensively with UNIX systems.
B.     Parallel and Collect: Parallel (or mcparallel) allows us to create an external process that does something. After hitting enter, R will not wait for the call to finish. Collect allows us to retrieve the results of the spawned processes.
C.     Foreach: Allows the user to iterate through a set of elements in a collection parallely without use of explicit loop counter. By default, iteration is sequential, unless we use a package such as doMC which is a parallel backend for foreach. DoMC is an interface between foreach and multicore.
Let’s now discuss the rest of the Big Data Family in R. Following are some of the latest packages designed to handle big data in R:
A. bigmemory
This package consists of:

  • big.matrix

    is an R object that simply points to a data structure in C++. Local to a single R process and is limited by available RAM.

  • shared.big.matrix is similar, but can be shared among multiple R processes (similar to parallelism on data)

  • filebacked.big.matrix does not point to a data structure; instead it points to a file on disk containing the matrix, and the file can be shared across a cluster

The major advantages of using this package is:

  • Can store a matrix in memory, restart R, and gain access to the matrix without reloading data. Great for big data.
  • Can share the matrix among multiple R instances or sessions.
  • Access is fast because RAM is fast. C++ also helps.

One disadvantage could be that that matrices contain only one type of data.
B. Fast File Access (ff)
In bigmemory R keeps a pointer to a C++ matrix. The matrix is stored in RAM or on disk. In ff,R keeps metadata about the object, and the object is stored in a at binary file.

Allows R to work with multiple large Datasets. It also helps us to clean the system and not make a mess with tons of files. For modelling purposes, ffbase has bigglm.ffdf to allow to build generalized linear models easily on large data and can connect to the stream package for clustering & classification.

So if you had to make a choice between using bigmemory or ff, you can choose either as they both contribute similar performance.

C. Map/Reduce 

MapReduce is a way of dividing a large job into many smaller jobs producing an output, and then combining the individual outputs into one output. It is a classic divide and conquer approach that is parallel and can easily be distributed among many cores or a CPU, or among many CPUs and nodes.

mapReduce is a pure R implementation of MapReduce. Many authors state that mapReduce is simply:
apply(map(data), reduce)
By default, mapReduce uses the same parallelization functionality as sapply.
For Example: mapReduce (map, …, data, apply = sapply)
In Simple terms:

  • Map: Performs operations such as filtering and sorting of the data
  • Reduce: Performs Summary operations on the mapped data

D. Hadoop:  Hadoop is an open-source implementation of MapReduce that has gained a lot of traction in the data mining community. Hadoop is distributed with Hadoop Streaming, which allows map/reduce jobs to be written in any language including R.
The hadoopstreaming package is used in R to run map/reduce.

  1. Rhipe is another R interface for Hadoop that provides following advantages :
  • Incorporates an rhlapply function similar to the standard apply variants.
  • Uses Google Protocol Buffers.
  • Great flexibility in modifying Hadoop parameters.

At the end of the day, which package in R will you choose to handle Big Data?

  • For datasets with size in the range 10GB, bigmemory and ff handle themselves well.
  • For Datasets in the range of TB and PB, you can interface Hadoop in R

Interested in picking up some R skills? Take a look at Jigsaw’s Analytics Course with R
Image courtesy, creator photoexplorer


Are you ready to build your own career?