Working in new and emerging technologies, I am aware that a solution put together today can quickly get out-of-date as newer & better technologies & techniques keep coming up quite fast. I kind of felt the full import of it when I was thinking through the details of how to implement in Spark a solution that we had done mid last year (2017) on Hadoop platform for a large private bank.
One of the largest private banks in India is a customer of my client. The bank wanted a system which figures out the propensity to default the payment of the monthly loan installment (EMI) by its borrowers specifically those who have defaulted in the last 2 months and last 3 months. We provided a Hadoop based solution and I was responsible for the data engineering part which included ETL, pre-processing and post-processing of data. And also for the end-to-end development & deployment of the solution including setting up an internal Hadoop cluster with a team of 1 Hadoop admin and 1 Hive developer. Two data scientists/analysts with the assistance of an SQL developer worked on developing the models which are used for scoring/prediction. We were given access to 4 tables in an instance of the customer’s RDBMS which contained the loan details, demographic data of the borrowers, payment history and elaborate tracking of follow up actions which you typically see in financial institutions. Based on these the models were arrived at after considerable exploration, analysis, testing and iterations.
As for the technology platform, though the solution design was in place more or less from the outset, the timeline got extended due to a couple of change requests and requests for POCs, for example,
The picture below gives an overview of the design and flow of the application that was deployed and working at the site.
All the above steps are put into a Linux Shell script which is scheduled using cron on the Hadoop cluster’s namenode to run on monthly basis which I think can be termed as classical Hadoop use-case.
Migrating the application to Spark as we know, will make it faster (lightning fast as the Spark web site mentions) and all the other good stuff that Spark offers most importantly a uniform platform. The customer naturally is interested in using the system deployed than migrating to newer ways of doing the same thing.
However, thinking through the details of implementing this application in Spark we see that:
So we wouldn’t even need Hadoop and HDFS as a matter of fact! All we need is a cluster of commodity servers with say 32 or more GB of RAM and a TB or two of hard disk each. I generally brush aside articles with titles like ‘Hadoop is out!’ or worse still ‘Is Hadoop dead?’ etc., considering them as alarmist or attempts to get attention (or to use a trendy phrase grab eyeballs), but they are not totally off the mark after all.
However if we look at it at the enterprise level a data extraction exercise like the one in this case is most likely going to be used by multiple applications and not by just a single application.
This is where Hadoop can serve as a veritable data lake – collecting and storing the data from all sources and channels in whatever form it is given. Each application like the one above can dip into this data lake, take the data in its available form, cleanse it and bottle it as per the processing, reporting and analytics requirements.
The more the data, the better & more accurate the analytic models are. And all analytics require, if not demand, a robust pipeline for data pre-processing steps from cleansing to transformations to data reduction and so on.
So, Hadoop for sure has its prime place in Big Data technologies though it may not be synonymous with Big Data as it used to be just a few years ago.
This article has been originally published on my LinkedIn Pulse. You can read the original publication here.