Data Science Cheat Sheet For Beginners

Introduction

If you are a Data Scientist, you’re well aware of the numerous SQL statements, excel formulas, functions, and algorithms in your profession. While the ones you use often are undoubtedly mastered, sometimes you need to leap into a project that demands different applications or new tools of your programming language of preference.

This is a specially drafted list of Data Science cheat sheets. These Data Science cheat sheet resources will make your work easier and help you become a better Data Scientist. Read this to uncover the best references for Python, SQL, Machine Learning, seaborn and more.

Machine Learning

Machine Learning is changing our society, and Data Scientists are propelling that transformation. Machine Learning is used in our automated systems, Facebook algorithms, and Search engine results. However, there is a significant amount of programming that goes into constructing the Machine Learning models that customers deal with daily. It all starts with massive datasets and a lot of creative code.

The instant Machine Learning algorithms cheat sheet will be invaluable for Data Scientists who specialize in Machine Learning and analysts who are preparing to enter this booming domain.

 

Supervised Learning Algorithm Cheat sheet

 

Supervised Learning

Supervised learning algorithms aim to predict trends acquired in previous information on unknown data by mapping inputs to outputs. Supervised learning models can be either regression models, which strive to determine a continuous variable, or which attempt to predict a binary or multi-class variable

Here we have two types of supervised learning models-

  • Linear models
  • Tree-based models

 

Linear models

 

The outputs of linear models are a linear arrangement of characteristics. In this part, we will discuss the most used linear models in machine learning:

 

Algorithm Description Applications
Linear Regression An approach for modeling a linear connection between inputs and a numeric output variable.
  • Stock Price Forecast
  • Housing price forecasting
  • Customer lifetime value prediction
Logistic Regression An algorithm that represents a linear connection between inputs and a category output 1 or 0.
  • Credit risk score prediction
  • Customer churn forecast
Ridge Regression It is a member of the regression family that penalizes characteristics with poorly predicted outcomes by decreasing their coefficients closer to zero. It is relevant for classification and regression.
  • Automobile predictive maintenance 
  • Sales revenue forecasting
Lasso Regression It is a member of the regression family that penalizes characteristics with poorly predicted outcomes by reducing their coefficients to zero. It is relevant for classification and regression.
  • Housing price forecasting
  • Clinical outcome prediction using health data

 

Tree-based models

To forecast from decision trees, tree-based models employ a set of “if-then” rules. In this part, we will go through some of the most often used linear models in machine learning.

 

Algorithm Description Applications
Decision Tree To create predictions, Decision Tree models apply decision rules to features. It is relevant for classification and regression.
  • Customer churn forecast
  • Disease prediction
  • credit score modeling
Random Forests A form of ensemble learning that integrates the output of several decision trees.
  • Modeling of credit scores
  • Housing price forecasting
Gradient Boosting Regression Gradient Boosting Regression uses boosting to create predictive models from a group of poor predictive learners.
  • Car emission forecasting
  • Estimating ride-hailing fee
XGBoost The Gradient Boosting algorithm is an effective and adaptable boosting method. It is relevant for both classification and regression problems.
  • Churn prediction
  • Insurance claims processing
LightGBM Regressor A gradient boosting framework that is intended to be more effective than existing approaches.
  • Flight time prediction for airlines
  • Using health data to predict cholesterol levels

 

Unsupervised Learning Algorithm cheat sheet

Unsupervised learning is concerned with identifying broad patterns in data. This form of segmentation is generalizable and used for a wide range of objects. Clustering methods learn how to group like data points together, and association algorithms group distinct data points depending on predefined criteria.

 

Clustering models

 

Algorithm Description Applications
K-Means The most used approach—it dervies K clusters based on euclidean distances
  • Recommendation systems
  • Customer segmentation
Hierarchical Clustering A bottom-up methodology in which each data point is considered as its cluster, and the nearest two clusters are continually merged together.
  • Detection of Fraud
  • Similarity-based document clustering
Gaussian Mixture Models A probabilistic approach for representing evenly distributed clusters in a dataset.
  • Recommendation systems
  • Customer segmentation

 

Association

 

Algorithm Description Applications
Apriori Algorithm A rule-based technique that determines the most frequent itemset in a given dataset using prior information of frequent itemset attributes.
  • Recommendation engines
  • Promotion optimization

SQL

Data Scientists use SQL worldwide to arrange data into tables and deal with different datasets. SQL is often used to extract the necessary data for a specific study, followed by Python and its many specialized modules to handle the challenging project.

As a Data Scientist, you will utilize the following SQL commands and functions:

Basic SQL cheat Sheet

Important keywords

 

Keyword Description
SELECT state which columns to query.
FROM Declares which table/view to choose from
WHERE gives a condition
= compare a value to a given input
LIKE used with the where clause to get a specific pattern in a column
GROUP BY Sets similar data into groups
HAVING Specifies only rows where aggregate values match the specified conditions should be returned. 
INNER JOIN Gives all rows where the record of one table is similar to the records of another table.
LEFT JOIN Gives all rows from the left with similar rows on the right.
RIGHT JOIN Gives all rows from the right table with similar rows on the left.
FULL OUTER JOIN Gives rows similar either in the left or right table

 

Aggregate functions

Function Description
COUNT Give the no. of rows in a table.
SUM Add the values
AVG Gives the avg for of values
MIN Gives the smallest value of the group
MAX Gives the largest value of the group

Querying data

SQL Description
SELECT student FROM class Select data in column student from a table named class
SELECT * FROM class Select rows and columns from a table class
SELECT student FROM class

WHERE student = ‘Alex’

Select data in column student from a table class where student = ‘Alex’
SELECT student FROM class

ORDER BY student ASC (DESC)

Select data in column student from a table class and order by student. (in asc by default or desc order)
SELECT student FROM class

ORDER BY student LIMIT n OFFSET offset

Select data in column student from a table class and skip offset of rows and gives the next n rows
SELECT student, aggregate(subject)

FROM class

GROUP BY student

Select data in column student from a table class and group rows with aggregate function
SELECT student, aggregate(subject)

FROM class

GROUP BY HAVING clause

Select data in column student from a table class and group rows with aggregate function and filter groups using the HAVING condition.

Data modification

SQL Description
INSERT INTO class(columnfirst)

VALUES(list_value)

Insert a row into a table class
INSERT INTO class(columnlist)

VALUES (list_value), (list_value), …

Insert rows into a table class
INSERT INTO class(columnlist)

SELECT columnlist FROM subject

Insert rows from subject into a table class
UPDATE Class SET student = newvalue Update a new value in table class in the column student for all rows
UPDATE Class SET student = newvalue, father_name = new_value

WHERE condition

Update values in column student and father_name in table class that meet the condition
DELETE FROM class Delete rows from a table class
DELETE FROM class WHERE condition Delete all rows from table class that meet a certain condition

Math

Data Science is a highly difficult discipline that necessitates some pretty good mathematics. Depending on your field of study, you may be required to use calculus, linear algebra, and statistics regularly. To progress in the discipline, Data Scientists must comprehensively know the ideas and how they apply in various contexts.

They are tools for Data Science students and experts to find a certain equation or double-check their work swiftly.

Even for competent Data Scientists, many of these equations might get hazy if not used daily. This is  your quick-reference basic linear algebra data Science cheat sheet, containing basic terminology that Data Scientists might need.

Cheat Sheet for Linear Algebra

Notation

TERM NOTATION
vector denoted by small letter v with arrow above
scalar any real number, e.g. 

2, 1,⅓ or π

matrix A, represented by capital letter and equals a m × n  matrix
m × n m rows times n columns
basis vectors represented by letters i, j and k with a ^ hat over
mapping T:Rm →Rn, Changing from m to n 
determinant scalar, the area or volume of vectors
cross product length perpendicular to the plane of two vectors in three dimensions
dot product scalar, when one vector meets another vector

Data Science Resources

If you’re just starting your career in Data Science or are still studying to become a Data Scientist, you need to brush up on essential terminology and Excel functions. This cheat sheet will give important shortcuts and commands and paste-able formulae that will save you time.

Excel cheat sheet

Function Shortcut
Add Current Date ctrl+;
Add Current Time shift+ctrl+;
Edit Cell Comment shit+F2
Show Active Cell ctrl+backspace
Add Column alt+lC
Add Row alt+lR
Fill Down ctrl+D
Fill Right ctrl+R
Save Workbook shift+alt+F2
Add Chart Alt+F1
Move to Last ctrl+END

Excel cell reference cheat sheet

Formulas require a cell reference. Defining the cell reference will affect how the formula is implied and copied from one to another. 

Relative Cell Reference =A2+B2
Absolute Cell Reference +$A$1

Excel date and time cheat sheet

Function Syntax Description
DATE DATE(year, month, day) returns a date given the parameters of year, month, date. 
DATEDIF DATEDIF(startdate,enddate,unit) calculates the time between two given dates. 
DAY DAY(serial no.) returns the actual day of a date (integer between to 31)
EDATE EDATE(startdate, months) adds a period of months onto a start date. 
EOMONTH EOMONTH(start_date, months) same as the EDATE, returns the last period in the month. 
NOW NOW() returns the serial no. showing the date at the real time
TODAY TODAY() returns the serial no. showing the date
YEAR YEAR() returns the serial no. showing the date into a year. 

 

Conclusion

In this article, the recommended cheat sheets are a narrowed-down list of the best. They will keep you covered in the projects and help you brush up on your skills.

It’s critical to stay up with innovations in this fast-changing digital industry, no matter where you are on your Data Science journey. Every aspect of your profession is prone to change and progress with time. Data analysis programming languages, tools, and procedures are upgrading and becoming more robust. It is one of the best things that makes this profession so appealing.

Learning is a never-ending process. So, continue learning and advance professionally. Enroll in the latest online programs and webinars on big data, deep learning, Machine Learning, or Artificial intelligence if you want to dive further into a specific field of Data Science. 

Related Articles

} }
Request Callback