EDA is a fundamental step in any data science process. It helps reveal hidden data characteristics and opens up a new world to how the data is seen. When getting started to build machine learning models, one mistake data science enthusiasts make is to get on with feature engineering without a proper exploratory data analysis. In this tutorial, you will discover why it is crucial to carry out EDA and additionally, the key steps to take when doing EDA.
Let’s jump in.
What is Exploratory Data Analysis?
Exploratory data analysis is the process of using statistical techniques and visualizations to transform the data, find hidden features in the data, and discover trends in the data. The transformation of EDA can be essential for additional analysis.
Exploratory Data Analysis can bring to light the importance of various features in the dataset and can, in fact, expose anomalies in the data. Let’s talk about the reasons you’d need exploratory data analysis specifically.
Why Exploratory Data Analysis
The importance of EDA cannot be overemphasized. It serves two major purposes. First,
- EDA cleans the data: A lot of times, you’d be having data that comes with duplicates, errors, missing values, etc. EDA cleans and sanitizes the data to a form that is good for the machine learning model. Second,
- EDA helps you to better understand the variables and their relationship with one another. You can determine features that are highly correlated, those that are less correlated as well as the important features.
How to do Exploratory Data Analysis
- Check the size of the data: You can understand the data but finding out the number of samples and features in the data. This will give you an idea of the size of the data you are dealing with. With this in mind, you will be able to determine the best methods to implement when building your model.
- Check the nature of the features: Features in a dataset can either be continuous or categorical. They can also be numerical or in string type.
- Check for missing values: Missing values can greatly affect the output of your model. You can replace missing values with the mean or median of the variable. If the missing values are not too many, the rows can as well be dropped completely.
- Check for outliers: Outliers are extremely high or low entries in the data. The presence of outliers can tuple the balance of the data, making results inaccurate. In EDA, you should check for outliers and deal with them accordingly.
- Create plots: Creating plots and visualizations are a critical part of exploratory data analysis. Let’s discuss some of the important plots you can make during EDA.
Important Visualizations during EDA
- Barchart: Barchart is typically used to show and compare the frequency of a categorical feature with respect to other categorical features.
- Histogram: Histograms are used to plot the frequency distribution of values in a numerical feature. If for instance, you want to check how the ages of your samples are spaced out, you can plot a histogram for this visualization
- Line chat: A line chart is typically used to keep track of changes in a feature, especially in situations where the changes are small. This explains why line charts are used to plot the stock market or cryptocurrency changes and not bar charts.
- Scatter plot: Scatterplots are typically used to visualize how two variables are related.
- Box plot: Box plots are used to carry out quick visualization of the dispersion of a variable in quartiles. They are also used to reveal the existence of outliers in the data.
These are some of the most important plots used in EDA.
In conclusion, you have seen what exploratory data analysis is and why it is useful in building machine learning models. You also learned about the things to check when carrying out EDA and the necessary plots to make.