In this post we will focus on data preparation steps for machine learning tasks. While this might seem like a laborious task, it is an essential step on the way to building quality models. If done correctly, many of the steps can be automated and executed efficiently, thus enabling your team to focus more on the core research and less on the engineering overhead. While there are many types of data, and different origins for datasets, we will discuss the stages of data preparation that are common to most datasets, whether it be a handcrafted tabular dataset, ImageNet or unstructured text from large sections of the internet.
Splitting the Data
Split your data into training, validation and test sets. The test set, or hold-out set, should not be accessed until final evaluation except for processing and cleaning stages. In other words, develop insights from the training data, and then apply processing to all datasets. This will enable you to get an accurate estimate of your final model’s performance on real world data. Use the validation set for hyperparameter tuning and model selection.
The first step in data preparation for machine learning is getting to know your data. Exploratory data analysis (EDA) will help you determine which features will be important for your prediction task, as well as which features are unreliable or redundant. The first step to training a quality model is getting your hands dirty with the data and becoming familiar with it.
EDA on data from restaurant tips may lead you to decide to incorporate features for whether the amount is rounded to whole/half dollar (source)
Data cleaning in machine learning is essential for generating high-quality models. Clean data will enable the training process to focus on real meaningful patterns in the data, and not waste energy on missing and incorrect data.
Perhaps the most common type of dirty data is missing values. This could manifest as nan (not a number) values, blank strings or default category. If many entries are missing a specific attribute, we may consider removing the attribute completely. On the other hand, if a specific entry is missing many attributes we may want to disregard the entry. This can be done easily using the pandas library. We will demonstrate some of the steps on data from the well known Titanic ML competition.
df = pd.read_csv('../input/titanic/train.csv') df.info() Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
It seems that the cabin field is full of null values, so we decide to remove it from the data:
df = df.drop('Cabin',axis=1)
Similarly, for categorical data we want to ensure that the data falls in the expected categories:
df['Embarked'].unique() Output: array(['S', 'C', 'Q', nan], dtype=object)
We can then remove the rows for which ‘Embarked’ has a nan value:
df = df[df['Embarked'].isna()!=True] df['Embarked'].unique() Output: array(['S', 'C', 'Q'], dtype=object)
More advanced cleaning may include more sophisticated outlier detection and removal as well. Human annotators are often used to validate and correct datasets as well. While this can be impractical for large datasets, you may want to focus energy on cleaning the test set at the very least.
Usually it is best to get rid of redundant features. Rather than giving additional information, redundant features are generally not helpful and they produce noise. In order to detect such features start by plotting the feature covariance matrix.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt sns.heatmap(df.corr(),annot=True) plt.show()
In our example the most highly correlated features are Parch (number of parents/children aboard the titanic) and SibSp (number of siblings/spouses aboard).
In addition to this manual method for detection of redundant features, we can use a method such as PCA for dimensionality reduction and feature selection which aims to generate features that are independent from each other.
Normalization and Standardization
Normalization and standardization are practices used to make the data more uniform. This way algorithms can treat different features “equally” and not be affected dramatically by whether a measurement uses inches or cm, similarly, two identical pictures taken from different cameras should have similar features.
In normalization we bring all of the values into a predefined range ([0,1] for example). This is done using the following formula:
In standardization we essentially make the features follow a normal distribution:
It is important to note that the parameters used for normalization and standardization should be determined only by the training set to avoid data leakage.
Conversion to Numerical Data
Machine learning models consist of many small mathematical operations, and thus they generally operate on numbers. How do we convert categorical data and textual data to numbers?
As a first attempt, each possible category can be assigned a different number. For example, we can define shoe-0, shirt-1 and pants-2 for the mnist fashion dataset. However, there is a problem with this method, since it implies that “shoe” and “shirt” are more similar to each other than “shoe” and “pants”, while this is completely arbitrary. Furthermore, if a model needs to act very differently depending on the value of this categorical feature, it is not really possible using just simple mathematical operations when the different classes are simply projected onto a line.
Now to our second attempt. Let’s convert each possible categorical value into a binary feature. Thus a single categorical feature is converted into a 1-hot vector (only one value is turned on), and our model will be able to learn patterns that relate to each class separately. While this option is a good one for categories with a small number of optional values, what should we do in cases where there is a huge set of optional values? How can we represent a word as a vector in an efficient manner?
df = pd.get_dummies(df,drop_first=True)
Code for converting categorical features to one-hot features
A widely used option for representing words is word embeddings. These embeddings are relatively dense continuous vectors (about a few hundred real valued numbers) which are capable of capturing semantic attributes of words. Some common implementations of these representations are word2vec and GloVe.
Data preparation is an important step in developing machine learning models. According to Figure Eight’s 2019 State of AI report, nearly three quarters of technical respondents spend over 25% of their time managing, cleaning and/or labeling data. We recommend recognizing the large role of data preparation in the process of developing ML models and directing resources to making this process efficient and accurate. Proper data preparation will save you time in later stages like debugging and validating your machine learning model, and it will ensure that training is focused on the actual task you wish to learn.