Introduction

There are several models that data can be fit into for a thorough analysis. But before you do so, you have to determine which model is an ideal fit for the data at hand. For this reason, you end up exploring the data, its shape, characteristics and come up with a summary that describes the current state of data at hand and whether the data needs further processing before it can be modeled with statistical and scientific techniques. This exploration of data, usually with the help of descriptive statistics, visualization tools, and presentation techniques, make up for what we call Exploratory Data Analysis or EDA in data science.

EDA in data science is quite like the service advisor doing a rough inspection of your car, asking a few preliminary questions, setting expectations and then taking the car in for service. It is one of the first things done with the data, so it is a critical phase, as many inferences and consequent actions depend on this exploration.

In this article let us look at:

  1. Definition
  2. What is EDA
  3. Examples of EDA in Data Science
  4. Techniques of EDA in Data Science
  5. Tools

1. Definition

The initial analysis of data supplied or extracted, to understand the trends, underlying limitations, quality, patterns, and relationships between various entities within the data set, using descriptive statistics and visualization tools is called Exploratory Data Analysis (EDA). EDA will give you a fair idea of what model better fits the data and whether any data cleansing and massaging might be required before taking the data through advanced modelling techniques or even put through Machine Learning and Artificial Intelligence algorithms.

2. What is EDA

EDA can be quite extensive and time-consuming depending on what and how much data you have. Unfortunately, there is no structured way to perform EDA, although there are a few techniques that will give you the best results out of EDA. Of the many outcomes of EDA, the important ones that one should try to get from the data are, 

  • Detect outliers and anomalies
  • Determine the quality of data
  • Determine what statistical models can fit the data
  • Find out if the assumptions about the data, that you or your team started out with is correct or way off.
  • Extract variables or dimensions on which the data can be pivoted.
  • Determine whether to apply univariate or multivariate analytical techniques.

3. Examples of EDA in Data Science

Ok, time for some examples, that might give you an idea about what EDA really entails and what are you looking for, what questions are you trying to answer. 

Example 1: Missing data

With data comes a lot of anomalies. One of them is missing data. Although the overall data might be good, there are columns within the data set that might be missing values. This can skew your results and not provide an accurate model for further use.

One great way to identify missing values visually is the use of the missing package in Python. This is obviously for a large data set. This gives you a graphic story of how much data is missing and on which variables.

Without going into the details of the coding, the above graphic was achieved with a single line of code. The white lines in the above graph indicate missing values.

Example 2: Summary statistics

Another example here is of summary statistics that give you a fair idea of your numeric data.

As shown above, the columns are features of the data set, and the statistics on the left column describe each.

Example 3: Outliers

Outliers are data that lie on the extreme or even outside the spectrum of values that a variable should normally hold, thereby giving you a hint or an opportunity to explore.

You can immediately notice an outlier, where it shows a good percentage of customers are buying more than 50 products. An investigation can be initiated and, in many cases, they turn out to be resellers. This can be seen as an opportunity to develop a B2B relationship with the resellers and grow it as a separate vertical in the business.

4. Techniques of EDA in Data Science

There are broadly two categories of EDA, graphical and non-graphical. These two are further divided into univariate and multivariate EDA, based on interdependency of variables in your data.

Univariate non-graphical: Here, the data features a single variable, and the EDA is done in mostly tabular form, for example, summary statistics. These non-graphical analyses give you a statistic that indicates how skewed your data might be or which is the dominant value for your variable if any.

Univariate graphical: The EDA here, involves graphic tools like bar charts and histograms to get a quick view of how variable properties are stacked against each other, whether there is a relationship between these properties and whether there is any interdependency among these properties.

Multivariate non-graphical: Non-graphical methods like crosstabs are used to depict the relationship between two or more variables. Statistical values like correlation coefficient indicate if there are a possible relationship and the measure of correlation.

Multivariate graphical: A graphical representation always gives you a better understanding of the relationship, especially among multiple variables.

5. Tools

The most commonly used software tools to perform EDA are Python and R.

Both enjoy massive community support and frequent updates on packages that can be used to EDA. Let’s look at the various graphical instruments that can be used to execute an EDA.

  • Box plots

Box plots are used where there is a need to summarize data on an interval scale like the ones on the stock market, where ticks observed in one whole day may be represented in a single box, highlighting the lowest, highest, median and outliers.  

  • Heatmap

Heatmaps are most often used for the representation of the correlation between variables. Here is an example of a heatmap.

As you can see from the chart, there is a strong correlation between density and residual sugar and absolutely no correlation between alcohol and residual sugar.

  • Histograms

The histogram is the graphical representation of numerical data that splits the data into ranges. The taller the bar, the greater the number of data points falling in that range. A good example here is the height data of a class of students. You would notice that the height data looks like a bell curves for a particular class with most the data lying within a certain range and a few of outside these ranges. There will be outliers too, either very short or very small. 

Conclusion

Exploratory Data Analysis is essential in the analysis of massive data sets, to be able to ensure that you have the right data for the chosen statistical model. You certainly would not want to figure out at a later stage that the data is not a good fit for the statistical model you are trying to build. A sound EDA must be performed before any data mining, data analysis, or data modeling occurs.

If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional. 

ALSO READ

SHARE