Exploratory Data Analysis (EDA) is a way of evaluating datasets to highlight their essential properties, generally using visual approaches. Before beginning the modeling work, EDA is used to examine what the data can tell us. It’s not simple to deduce essential data qualities from a column of numbers or an entire spreadsheet. Deriving insights from raw data may be tiresome, uninteresting, and/or overpowering. In this case, exploratory data analysis approaches have been developed as a help.
There are two methods to categorize exploratory data analysis. The first distinction is that each approach is either non-graphical or graphical. Second, each approach is either univariate or multivariate, with bivariate being the most common.
So what should you do to conduct exploratory data analysis?
Missing data
Before you begin studying the data, make an effort to grasp it at a high level. Speak with leadership and product to get as much information as can to assist you to decide where to concentrate your efforts. Do you want to try your hand at predicting something? Is the work only for the sake of research? You could highlight quite different items in your EDA depending on the desired consequence.
Now that you’ve established how you’ll use the data, we can start looking at the data itself. It’s typically a good idea to start by looking for missing values in your data. I recommend examining aspects one at a time and ranking them according to your unique analysis for this and future analyses.
Now, for each characteristic, I recommend that you try to figure out why the data is missing and what it could signify. Sadly, this isn’t always that straightforward, and a solution may not be available. As a result, a whole branch of statistics known as imputation is dedicated to the problem and provides numerous solutions. The strategy you choose is dependent on the type of data you have. You may use the mean or median to fill in missing values in time series data with no seasonality or trend. If there is a trend in the time series but no seasonality, you can use linear interpolation. If it has both, you should correct for seasonality before applying a linear interpolation.
Shape of data
You look at how the feature evolves if the dataset is a time series. The characteristic may have seasonality or a positive/negative linear trend over time. All of these factors should be considered in your EDA.
The mean and variance of each characteristic will be calculated next. Does the feature alter very little? Is it in continual flux? Make an educated guess on the behavior you observe. A trait with a very low or extremely high variance may need to be investigated further.
Your partners are Probability Density Functions and Probability Mass Functions. PMFs are used for discrete features while PDFs are used for continuous features to comprehend their form.
Correlations
The connection between two variables is measured through correlation. Let’s look at the relationship between two discrete features: Delivered Orders and Fulfilled Orders. Plotting a scatter plot using Delivered Orders on the y axis and Fulfilled Orders on the x-axis is the simplest approach to visualize the correlation. These two characteristics have a favorable association, as predicted.
If your dataset has a large number of characteristics, you won’t be able to produce this plot for all of them since it will take too long. So, given your dataset, I propose constructing the Pearson correlation matrix. It calculates the linear correlation between features in your dataset and provides each pair a value between -1 and 1. A positive score denotes a favorable association, whereas a negative number denotes an unfavorable relationship.
It’s crucial to keep track of any major relationships between characteristics. It’s conceivable that you’ll see a lot of links between characteristics in your dataset, but it’s also possible that you’ll see very little. Each dataset is unique! Form theories on why certain characteristics are related to one another.
Wrap Up
- Your data might be afflicted with missing values. Make sure you know why they’re there and what you’re going to do about it.
- Give a brief description of your characteristics and group them into categories. This will have a significant impact on the visuals and statistical approaches you utilize.
- Visualize the distribution of your data to have a better understanding of it. You never know what you’ll come upon! Get to know how your data evolves over time and across samples.
- There are connections between your attributes. Make a mental note of them. These connections may prove useful in the future.