In this concluding part of the 3-part series, leaders of ALN discuss the important skills needed of a Data Scientist and how to hire one.
What is Data Science?? How to become a Data Scientist? How to transition to data science field from software engineering?? What is AI & ML?? Will I become Data Scientist if I learn all the Algorithms??
Some of the obvious questions I get asked regularly from aspiring data scientists. Is Data Science all about learning Algorithms. I beg to differ. Data science or data analytics is all about solving business problem. AI, Machine Learning, Deep learning are ammunitions to solve the business problem. So, the core is “solving business problem” and not algorithms.
It’s been more than a decade in the industry and when I started my career, the industry problem used to revolve around structured data, which slowly moved to a combination of structured and unstructured data. Web and Social Media Analytics became increasingly important in the past 3-4 years. Now, Image & Video Analytics is gaining momentum.
Earlier, “Model development” was considered a very niche skillset and it used to help us solving various complex industry problems. With growing data complexity, It became imperative to start using cutting edge technologies like AI & ML rather than only relying on statistical models for solving the industry problems. Also, usage of open sources like R & Python has become industry norm as compare to mostly relying on SAS.
But, all this only helps you in solving the business problem. If you use the most sophisticated machine learning technique but are unable to create a business case and cannot effectively communicate to your stakeholders, it would be of no practical business use.
My advice to budding data scientists is to not get distracted with the fast pace of changing technologies. What you have learnt today may become obsolete tomorrow.
The most important skill-set that you should acquire is “problem solving skill”.
Me: “What have you done in your capstone project?”
Student: “I started with logistic regression. Then I tried decision trees. I tried to improve the model by using random forest. I also tried other ML techniques like SVM and neural networks. I am now thinking of trying deep learning.”
Me: “These are just algorithms. I am just asking you what problem you solved in your project.”
Student: “But I used so many different techniques. I can now add all of these to my CV.”
Me: “All I want to know is how did it help the business?”
Student: “It helped me add so many new skills to my resume. I think I will go apply for the role of a machine learning specialist now.”
As someone who has been teaching data science for over 10 years now, I am still amazed at the fascination all aspiring data scientists have with fancy modeling techniques. Everyone wants to learn as many different techniques as possible. Or rather be able to add them to their CV and claim to know them all.
In my experience, being well versed with multiple algorithms or modeling techniques is a good-to-have skill. But not the be-all and end-all of learning data science.
If you have built a logistic regression model with an 80% accuracy, in most cases a machine learning technique like Neural networks may get it up to 82 or 83%. It is certainly not going to suddenly up it to 99%. That kind of lift can only happen by adding more information to your model – by getting more data.
Yet, when I see young data scientists, they are obsessed with trying fancy techniques on their data rather than trying to identify more information that can improve the predictability of the problem itself.
My humble plea to all aspiring data scientists – Stop chasing fancy algorithms. Just because ‘Deep learning’ is the latest buzzword does not mean every data science problem needs a deep learning solution. Don’t be obsessed about ensuring every algorithm is there on your CV. Instead when you get a chance to work on real life problems, like a Capstone project, focus your time and energy on things that add value – like trying to pull information that will make your model stronger. This is what you will need to do in the real world as well.
Always remember this quote from Peter Norvig, Google’s research director – “We don’t have better algorithms. We just have more data.”
There is a vicious cycle in the data science industry and it starts with the catch all Job Description put out by most corporations. This ‘stupid’ JD often includes requirements that start from probability and statistics, goes on to SQL and database management, segues into about 15 machine learning algorithms (trees, random forest, bagging, SVM, knn, among others), side tracks into buzzwords like deep learning, artificial intelligence, cognitive computing, asks for expertise 3-5 (R, Python, SAS etc.), and then wants you to send a video of yourself doing a Michael Jackson moonwalk while doing a handstand in the middle of a highway! Well, maybe we got a bit carried away in the last part there, but you do get the drift.
Now, since hundreds of corporations are throwing such JDs at those who are planning on a career in data science, candidates are confused and trying to cram through algorithm after algorithm, without working on their basic data prep, statistics or business skills. And as educators we have a large number of potential students walking in to the center asking us why ‘arcane algo X’ or ‘bizarre algo Y’ is not taught? The answer is simple – because no one really uses it and you cannot master everything in a course. Get your basics right, learn a few algorithms well, get to know your data and how to prepare it for analysis – and the rest will follow. But not all are convinced!
Continuing with that vicious cycle, most institutes try to cram in a lot more algorithms and fashionable terms than can be possibly dealt with within a year. That becomes the only way to attract students. And the students are significantly worse off, not having mastered the basic assumptions of linear regression but mouthing off terms from artificial intelligence, which they have barely had a chance to read the Wikipedia article. And when these students would go to the job interview, which had the stupid JD – the data science managers will crush them and get shocked at their foolhardiness.
We propose that we all get real. Let companies get their JDs closely checked not by HR teams but by the business units. And let the business heads reach out to their teams to tell them that the stupid JD will only get you stupid candidates. Let us not ask for more than a few select algorithms and tools that we really use – because a good candidate can always skill sideways. We have not seen anyone who is a good R coder who was completely stumped in Python or vice-versa. Old timer like us who started off when a lot of the latest algorithms did not even exist as a thought and nor these tools would not have had a job now, let alone leadership roles, if we could not keep learning. And we are sure, so will the youngsters in the field.
If you are an analytics leader and wish to share your experience or thoughts about the Indian analytics space, I invite you to write to me.