Data Science Cheat Sheet For Beginners
Introduction
If you are a Data Scientist, you’re well aware of the numerous SQL statements, excel formulas, functions, and algorithms in your profession. While the ones you use often are undoubtedly mastered, sometimes you need to leap into a project that demands different applications or new tools of your programming language of preference.
This is a specially drafted list of Data Science cheat sheets. These Data Science cheat sheet resources will make your work easier and help you become a better Data Scientist. Read this to uncover the best references for Python, SQL, Machine Learning, seaborn and more.
Machine Learning
Machine Learning is changing our society, and Data Scientists are propelling that transformation. Machine Learning is used in our automated systems, Facebook algorithms, and Search engine results. However, there is a significant amount of programming that goes into constructing the Machine Learning models that customers deal with daily. It all starts with massive datasets and a lot of creative code.
The instant Machine Learning algorithms cheat sheet will be invaluable for Data Scientists who specialize in Machine Learning and analysts who are preparing to enter this booming domain.
Supervised Learning Algorithm Cheat sheet
Supervised Learning
Supervised learning algorithms aim to predict trends acquired in previous information on unknown data by mapping inputs to outputs. Supervised learning models can be either regression models, which strive to determine a continuous variable, or which attempt to predict a binary or multi-class variable
Here we have two types of supervised learning models-
- Linear models
- Tree-based models
Linear models
The outputs of linear models are a linear arrangement of characteristics. In this part, we will discuss the most used linear models in machine learning:
Algorithm | Description | Applications |
Linear Regression | An approach for modeling a linear connection between inputs and a numeric output variable. |
|
Logistic Regression | An algorithm that represents a linear connection between inputs and a category output 1 or 0. |
|
Ridge Regression | It is a member of the regression family that penalizes characteristics with poorly predicted outcomes by decreasing their coefficients closer to zero. It is relevant for classification and regression. |
|
Lasso Regression | It is a member of the regression family that penalizes characteristics with poorly predicted outcomes by reducing their coefficients to zero. It is relevant for classification and regression. |
|
Tree-based models
To forecast from decision trees, tree-based models employ a set of “if-then” rules. In this part, we will go through some of the most often used linear models in machine learning.
Algorithm | Description | Applications |
Decision Tree | To create predictions, Decision Tree models apply decision rules to features. It is relevant for classification and regression. |
|
Random Forests | A form of ensemble learning that integrates the output of several decision trees. |
|
Gradient Boosting Regression | Gradient Boosting Regression uses boosting to create predictive models from a group of poor predictive learners. |
|
XGBoost | The Gradient Boosting algorithm is an effective and adaptable boosting method. It is relevant for both classification and regression problems. |
|
LightGBM Regressor | A gradient boosting framework that is intended to be more effective than existing approaches. |
|
Unsupervised Learning Algorithm cheat sheet
Unsupervised learning is concerned with identifying broad patterns in data. This form of segmentation is generalizable and used for a wide range of objects. Clustering methods learn how to group like data points together, and association algorithms group distinct data points depending on predefined criteria.
Clustering models
Algorithm | Description | Applications |
K-Means | The most used approach—it dervies K clusters based on euclidean distances |
|
Hierarchical Clustering | A bottom-up methodology in which each data point is considered as its cluster, and the nearest two clusters are continually merged together. |
|
Gaussian Mixture Models | A probabilistic approach for representing evenly distributed clusters in a dataset. |
|
Association
Algorithm | Description | Applications |
Apriori Algorithm | A rule-based technique that determines the most frequent itemset in a given dataset using prior information of frequent itemset attributes. |
|
SQL
Data Scientists use SQL worldwide to arrange data into tables and deal with different datasets. SQL is often used to extract the necessary data for a specific study, followed by Python and its many specialized modules to handle the challenging project.
As a Data Scientist, you will utilize the following SQL commands and functions:
Basic SQL cheat Sheet
Important keywords
Keyword | Description |
SELECT | state which columns to query. |
FROM | Declares which table/view to choose from |
WHERE | gives a condition |
= | compare a value to a given input |
LIKE | used with the where clause to get a specific pattern in a column |
GROUP BY | Sets similar data into groups |
HAVING | Specifies only rows where aggregate values match the specified conditions should be returned. |
INNER JOIN | Gives all rows where the record of one table is similar to the records of another table. |
LEFT JOIN | Gives all rows from the left with similar rows on the right. |
RIGHT JOIN | Gives all rows from the right table with similar rows on the left. |
FULL OUTER JOIN | Gives rows similar either in the left or right table |
Aggregate functions
Function | Description |
COUNT | Give the no. of rows in a table. |
SUM | Add the values |
AVG | Gives the avg for of values |
MIN | Gives the smallest value of the group |
MAX | Gives the largest value of the group |
Querying data
SQL | Description |
SELECT student FROM class | Select data in column student from a table named class |
SELECT * FROM class | Select rows and columns from a table class |
SELECT student FROM class
WHERE student = ‘Alex’ |
Select data in column student from a table class where student = ‘Alex’ |
SELECT student FROM class
ORDER BY student ASC (DESC) |
Select data in column student from a table class and order by student. (in asc by default or desc order) |
SELECT student FROM class
ORDER BY student LIMIT n OFFSET offset |
Select data in column student from a table class and skip offset of rows and gives the next n rows |
SELECT student, aggregate(subject)
FROM class GROUP BY student |
Select data in column student from a table class and group rows with aggregate function |
SELECT student, aggregate(subject)
FROM class GROUP BY HAVING clause |
Select data in column student from a table class and group rows with aggregate function and filter groups using the HAVING condition. |
Data modification
SQL | Description |
INSERT INTO class(columnfirst)
VALUES(list_value) |
Insert a row into a table class |
INSERT INTO class(columnlist)
VALUES (list_value), (list_value), … |
Insert rows into a table class |
INSERT INTO class(columnlist)
SELECT columnlist FROM subject |
Insert rows from subject into a table class |
UPDATE Class SET student = newvalue | Update a new value in table class in the column student for all rows |
UPDATE Class SET student = newvalue, father_name = new_value
WHERE condition |
Update values in column student and father_name in table class that meet the condition |
DELETE FROM class | Delete rows from a table class |
DELETE FROM class WHERE condition | Delete all rows from table class that meet a certain condition |
Math
Data Science is a highly difficult discipline that necessitates some pretty good mathematics. Depending on your field of study, you may be required to use calculus, linear algebra, and statistics regularly. To progress in the discipline, Data Scientists must comprehensively know the ideas and how they apply in various contexts.
They are tools for Data Science students and experts to find a certain equation or double-check their work swiftly.
Even for competent Data Scientists, many of these equations might get hazy if not used daily. This is your quick-reference basic linear algebra data Science cheat sheet, containing basic terminology that Data Scientists might need.
Cheat Sheet for Linear Algebra
Notation
TERM | NOTATION |
vector | denoted by small letter v with arrow above |
scalar | any real number, e.g.
2, 1,⅓ or π |
matrix | A, represented by capital letter and equals a m × n matrix |
m × n | m rows times n columns |
basis vectors | represented by letters i, j and k with a ^ hat over |
mapping | T:Rm →Rn, Changing from m to n |
determinant | scalar, the area or volume of vectors |
cross product | length perpendicular to the plane of two vectors in three dimensions |
dot product | scalar, when one vector meets another vector |
Data Science Resources
If you’re just starting your career in Data Science or are still studying to become a Data Scientist, you need to brush up on essential terminology and Excel functions. This cheat sheet will give important shortcuts and commands and paste-able formulae that will save you time.
Excel cheat sheet
Function | Shortcut |
Add Current Date | ctrl+; |
Add Current Time | shift+ctrl+; |
Edit Cell Comment | shit+F2 |
Show Active Cell | ctrl+backspace |
Add Column | alt+lC |
Add Row | alt+lR |
Fill Down | ctrl+D |
Fill Right | ctrl+R |
Save Workbook | shift+alt+F2 |
Add Chart | Alt+F1 |
Move to Last | ctrl+END |
Excel cell reference cheat sheet
Formulas require a cell reference. Defining the cell reference will affect how the formula is implied and copied from one to another.
Relative Cell Reference | =A2+B2 |
Absolute Cell Reference | +$A$1 |
Excel date and time cheat sheet
Function | Syntax | Description |
DATE | DATE(year, month, day) | returns a date given the parameters of year, month, date. |
DATEDIF | DATEDIF(startdate,enddate,unit) | calculates the time between two given dates. |
DAY | DAY(serial no.) | returns the actual day of a date (integer between to 31) |
EDATE | EDATE(startdate, months) | adds a period of months onto a start date. |
EOMONTH | EOMONTH(start_date, months) | same as the EDATE, returns the last period in the month. |
NOW | NOW() | returns the serial no. showing the date at the real time |
TODAY | TODAY() | returns the serial no. showing the date |
YEAR | YEAR() | returns the serial no. showing the date into a year. |
Conclusion
In this article, the recommended cheat sheets are a narrowed-down list of the best. They will keep you covered in the projects and help you brush up on your skills.
It’s critical to stay up with innovations in this fast-changing digital industry, no matter where you are on your Data Science journey. Every aspect of your profession is prone to change and progress with time. Data analysis programming languages, tools, and procedures are upgrading and becoming more robust. It is one of the best things that makes this profession so appealing.
Learning is a never-ending process. So, continue learning and advance professionally. Enroll in the latest online programs and webinars on big data, deep learning, Machine Learning, or Artificial intelligence if you want to dive further into a specific field of Data Science.