The output of a machine learning model is determined not only by the model and hyperparameters but also by how various types of variables are processed and fed into the model. Preprocessing categorical variables in data science becomes important since most machine learning models only consider numerical variables. We must translate these categorical variables to numbers in order for the model to work and in order for the machine learning model to comprehend and extract useful information.
Cleaning and processing data takes up 70 to 80 percent of a data scientist’s time. And categorical data conversion is an inevitable operation. It not only improves model accuracy, but it also aids feature engineering. The question now is, how can we go forward? Which type of categorical data encoding can we use?
What is categorical data?
Since we’ll be dealing with categorical variables in this post, here’s a refresher on the topic with some examples. Categorical variables are finite in number and are commonly defined as ‘strings’ or ‘categories.’ Listed below are a few categorical variable examples:
- The location of a person’s residence: Chicago, Mumbai, New York, London, and so on.
- Business Departments: HR (Human Resources), Finance, IT (Information Technology), and soon.
- A person’s highest degree: high school, diploma, bachelor’s, master’s, and doctorate.
The variables in the preceding examples only have definite possible values. We can also see that there are two types of categorical data-
- Nominal Data: There is no intrinsic order to the categories.
- Ordinal Data: The categories are arranged in a specific order.
When encoding ordinal data, it’s important to keep track of the order in which the categories are presented. As in the previous example, a person’s highest degree provides critical details about his qualifications. The degree is a significant factor in determining whether or not an individual is qualified for a position.
We must recognize the presence or absence of a function when encoding Nominal data. There is no sense of order in such a situation. For instance, a person’s home city. It is important to keep track of where an individual lives for data purposes. We don’t have some kind of order or series here. It makes no difference whether an individual lives in Delhi or Bangalore.
Ordinal and Nominal Variables
As the name implies, numerical data consists solely of numerical elements like floating-point values or integers.
Categorical data consists of variables with mark values rather than integer values.
The number of potential values is often restricted to a small number.
Here are a few examples:
- “Snake” and “Turtle” are the values of the “pet” attribute.
- “purple,” “yellow,” and “black” are the values of the “color” attribute.
- “third,” “fourth,” and “fifth” are the values of the “place” attribute.
Each value corresponds to a distinct group.
Some groups, such as natural ordering, can have a relationship that’s natural with one another.
The values of the “place” attribute/variable have an ordering that is natural, as mentioned above. Since the values may be ordered or graded, this type of variable is an ordinal variable.
By splitting the numerical variable’s range into bins and assigning values to each bin, you could convert to an ordinal variable from a numerical one.
- Nominal (Categorical) – A variable is a finite set of values that have no relation to one another.
- Ordinal – A variable is made up of a finite number of values that are arranged in a graded order.
Some algorithms can work directly with categorical data.
A decision tree, as an instance, can be learned from categorical data directly without the need for any data transformation.
Many ML algorithms are unable to work directly on label data. They demand that all output and input variables are numeric.
Generally speaking, rather than hard constraints on the algorithms themselves, this is primarily a restriction of efficient implementation of ML algorithms.
Some ML algorithms demand that all input data be numerical. This is a prerequisite in scikit-learn, for example.
This implies that categorical data must be transformed into numerical data. If the categorical variable is an output variable, you may want to convert the model’s predictions back to categorical form to use or present them in a program.
To sum up, categorical data must be encoded as part of the categorical feature engineering process. It is more crucial to understand which coding scheme should be used. Taking into account the dataset we’ll be dealing with and the model we’ll be employing.