In the business analytics field, the ability to work with multiple statistical tools is one of the important aspects which can help in building a successful career. Currently, the market is being dominated by tools such as SAS, SPSS and R. Due to its diverse functionality and being open source, R is slowly becoming an essential part of any business analyst/data scientist tool kit. Personally, I have been using R since 3 years and it hasn’t been a bad decision at all. Though the initial learning curve for R is steep, I slowly gained confidence and started writing lots of R programs comfortably. What actually interests me more about R is the possibility of having access to most of the data analysis techniques through additional packages and also the flexibility to create nice data visualizations. Based on my programming experiences with R, I am going to talk about the top 5 packages which would be more helpful while working on any real-time project. Remember that this list might vary from person to person.
Communicating the final results using good visualizations to business end users is often a critical aspect. For this purpose, ggplot2 package will come in quite handy and has been widely used by many R programmers. It is basically an implementation of the grammar of graphics in R and a great way to present your work. With this package, one can plot aesthetically appealing customized visualizations ranging from scatter plots to bubble graphs. Learning various features present in ggplot2 is easy and with practice, one can plot complex visualizations soon enough.
Prior to R, I was mostly working on SAS and SQL based RDBMS systems. After hearing about sqldf package, I found it so similiar with the programming during earlier times. It is so simple and one can perform SQL queries on R data frames. Anyone with basic SQL skills used for database querying can easily learn this one. sqldf uses SQLite syntax. If your work involves connecting to some external databases, then R has connecting drivers to most of them. Some of the examples for these are RODBC, RSQLite, RMySQL, RMongo, RPostgreSQL etc. So next time when you are planning to work on an external database like MySQL using R, then do remember to use RMySQL driver which lets you write SQL scripts in R console.
Performing data manipulation tasks on R can be quite taxing if one is not using plyr package. Initially, I was making use of in-built basic functions and loops for doing these tasks. But over the time, I learned about plyr package and started using it. This package is one of best available data manipulation packages and contains several functions for splitting up a dataset into pieces, apply a function to these pieces, combining output and examining the results. If you are dependent on “apply” family of functions for data manipulation in R, then I seriously recommend using plyr package and it won’t disappoint you.
Model building is one of the critical components in any predictive analytics project. You can find lot of packages in R for building linear models such as linear regression, logistic regression etc. and non-linear models such as random forests, neural networks, and support vector machines. Typically linear models are more interpretable and widely used in industry, whereas non-linear models lack interpretability and are often referred to as black-box models. Of these, one of my favourites would be randomForests package which helps you build a non-linear model between the variables present in the dataset. It is easy to use and works on different types of datasets. Other good thing about this package is it can be used as a feature reduction algorithm. Say when your dataset has more than 200 variables and you need to find the most significant ones, randomForests package has a variable importance function which will only list out the important variables in the dataset. If you want to start working on non-linear models, you should start with this package first.
Apart from quality of the data, feature selection and modeling technique used do play an important role for building better predictive models. In reality, model building is a very iterative process which involves lots of trials depending on the number of observations and variables present in the dataset. So it is no wonder for someone to think about an automated way which will perform most possible modeling trials and give us the best model. The answer for that would be caret package which deals with data handling, feature selection, building multiple predictive models using various techniques, performs validation checks and prints out the model performance diagnostics. This definitely looks like a lot and getting used to all these functions would take some time too, but once you do that it will make your model building experience more enjoyable. And because of this fact, caret has become quite popular in recent years amongst R programmers especially in Predictive Analytics field.