Introduction

In today’s data-heavy systems, where everything is captured for future use, you get a mammoth data set on every load; this data set could be both big in terms of observations or in terms of the number of features or columns, or both. Data mining becomes tedious in such cases, with only a few important features contributing to the value that you can take out of the data. Complex queries might take a long time to go through such huge data sets too. In such cases, a quick alternative is data reduction.

  1. What is Data Reduction
  2. Data Reduction Techniques
  3. Dimensionality Reduction
  4. Numerosity Reduction
  5. Parametric
  6. Non-Parametric
  7. Histogram
  8. Clustering
  9. Sampling
  10. Data Cube Aggregation
  11. Data Compression
  12. Key Takeaways

1. What is Data Reduction

According to Wikipedia, a formal definition of Data Reduction goes like this, “Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form.”.

2. Data Reduction Techniques

There are 2 primary methods of Data Reduction, Dimensionality Reduction and Numerosity Reduction.

3. Dimensionality Reduction

Dimensionality Reduction is reducing the number of dimensions the data is spread across, basically, the attributes or features that the data set carries as the number of dimensions increases the sparsity, which is critical to clustering, outlier analysis and other algorithms. With reduced dimensionality, it is easy to visualize data. There are at least 3 types of Dimensionality reduction.

  • Wavelet Transform

Wavelet Transform is a lossy method for dimensionality reduction where a data vector X is transformed into another vector X’, in such a way that both X and X’ still represent the same length. The result of wavelet transform can be truncated, unlike it’s original, thus achieving dimensionality reduction. Wavelet transforms are well suited for data cube, sparse data or data which is highly skewed. Wavelet transform is often used in image compression.

  • Principal Component Analysis

This method involves the identification of a few independent tuples with n attributes that can represent the entire data set. This method can be applied to skewed and sparse data.

  • Attribute Subset Selection.

Here attributes irrelevant to data mining or redundant ones are not included in a core attribute subset. The core attribute subset selection reduces the data volume and dimensionality.

4. Numerosity Reduction

This method uses an alternate, small forms of data representation, thus reducing data volume. There are 2 types of Numerosity reduction, Parametric and Non-Parametric.

5. Parametric

This method assumes a model that the data fits to. Data model parameters are estimated, and only those parameters are store, and the data is discarded. For example, a regression model can be used to achieve Parametric reduction if the data fits the Regression model. 

Linear Regression models a linear relationship between 2 attributes of the data set. Let’s say we need to fit a linear regression model between 2 attributes, x and y, where y is the dependant attribute, and x is the independent attribute or predictor attribute. The model can be represented by the equation y=wx b. w and b are regression coefficients. A multiple linear regression model lets us express the attribute y in terms of multiple predictor attributes.

Another method, Log-Linear model discovers the relationship between 2 or more discrete attributes. Assume, we have a set of tuples in n-dimensional space; the log-linear model helps to derive the probability of each tuple in this n-dimensional space.

6. Non-Parametric

A non-parametric numerosity reduction technique does not assume any model. The non-Parametric technique results in more uniform reduction, irrespective of data, but it may not achieve a high volume of data reduction like Parametric one. There are at least 4 types of Non-Parametric data reduction techniques, Histogram, Clustering, Sampling, Data Cube Aggregation.

7. Histogram

A histogram can be used to represent dense, sparse, skewed or uniform data, involving multiple attributes, effectively up to 5 together.

8. Clustering

In Clustering, the data set is replaced by the cluster representation, where the data is split between clusters depending on similarities to each other within-cluster and dissimilarities to other clusters. The more the similarity within-cluster, the close they appear within the cluster. The quality of the cluster depends on the maximum distance between any 2 data items in the cluster.

9. Sampling

Sampling is capable of reducing large data set into smaller sample data sets, reducing it to a representation of the original data set. There are 4 types of sampling data reduction methods.

Simple Random Sample Without Replacement of sizes

Simple Random Sample with Replacement of sizes

Cluster Sample

Stratified Sample

10. Data Cube Aggregation

Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction. 

Data Cube Aggregation, where the data cube is a much more efficient way of storing data, thus achieving data reduction, besides faster aggregation operations.

11. Data Compression

It employs modification, encoding or converting structure of data in a way that it consumes less space. Data compression involves building a compact representation of information by removing redundancy and representing data in binary form. Data that can be restored successfully from its compressed form is called Lossless compression while the opposite where it is not possible to restore the original form from the compressed form is Lossy compression.

12. Key Takeaways

  • Data reduction achieves a reduction in volume, making it easy to represent and run through advanced analytical algorithms.
  • Data reduction also helps in deduplication of data reducing the load on storage and the algorithms serving data science techniques downstream.

Conclusion 

Data Reduction can be achieved in two principal ways. One by reducing the number of data records, or the features and the other by generating summary data and statistics at different levels.

If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional. 

Also Read

SHARE
share

Are you ready to build your own career?