If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Supervised vs. Unsupervised Machine Learning: Types, Use Cases, and Engineering Challenges

Introduction

Supervised and Unsupervised learning algorithms are fundamental categories of Machine Learning. While we learn about them early in our data science journey, we might not fully understand their differences, use, and how we should approach them as engineering problems.

In this article, we will learn the difference between Supervised and Unsupervised Machine Learning algorithms, their main types, and where you use them.

We will also learn the main engineering challenges unique to each type. Supervising algorithms require special attention on their training data and target labels. Unsupervised algorithms are hard to interpret and validate.

Are you ready? Let’s jump right in!

We will demonstrate their differences using the scikit-learn library and the iris dataset. We can use the iris dataset for Supervised classification and Unsupervised clustering, and conveniently scikit-learn has modules for both.

The code below demonstrates how we pulled the data and prepared it for modeling:

# For IPython/Jupyter environments
%matplotlib inline

from sklearn import datasets
import pandas as pd
import numpy as np
import seaborn as sns

random_seed = 23
# %%
iris_source = datasets.load_iris()

features = iris_source["data"]
feature_names = iris_source.feature_names

labels = iris_source["target"]
label_names = iris_source.target_names

# %%
iris = pd.concat(
[
     pd.DataFrame(features, columns=feature_names),
     pd.Series(labels)
     .rename("labels")
     .map({i: label for i, label in enumerate(label_names)}),
],
axis=1,
)

iris.sample(5, random_state=random_seed)
Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Deepchecks HubOur GithubOpen Source

Supervised Machine Learning

Supervised Learning is a type of Machine Learning where you use input data or feature vectors to predict the corresponding output vectors or target labels. Alternatively, you may use the input data to infer its relationship with the outputs.

In a Supervised problem, you use a labeled dataset containing prior information about input and output. You teach the algorithm on this labeled dataset about the “right” outputs; hence called “supervised.”

For each observation of the predictor measurement(s) there is an associated response measurement Yi.

(source)

The Supervised algorithm uses a statistical learning method on the labeled dataset (X). It constructs a mapping function (f(X)) that best approximates the output (Y). In other words, it tries to create a mechanism that predicts outputs of new and previously unseen input data with as little error as possible.

The outcome of training is the model that you use on new inputs. We assess its performance by testing it on a separate test sample using appropriate metrics (e.g., RMSE, classification matrix).

In the following examples, we use a decision tree classifier to predict a test sample:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, train_size=0.8, random_state=random_seed
)

decision_tree_model = 
DecisionTreeClassifier(random_state=random_seed)
decision_tree_model.fit(X_train, y_train)
decision_tree_predictions = decision_tree_model.predict(X_test)

decision_tree_results = pd.DataFrame(
    np.concatenate(
        [X_test, y_test.reshape(-1, 1), decision_tree_predictions.reshape(-1, 1)],
     axis=1,
    ),
    columns=feature_names + ["true", "predicted"],
).melt(
    id_vars=feature_names,
    value_vars=["true", "predicted"],
    var_name="status",
    value_name="labels",
)

decision_tree_results[
    ~decision_tree_results.duplicated(subset=feature_names + ["labels"], keep=False)
].sort_values(feature_names)

Most of the predictions are correct, save one sample, where the model predicted a different iris type:

Types of Supervised Algorithms

We commonly group Supervised algorithms based on their predicted output type:

Another grouping of Supervised algorithms considers the learning algorithm used. Here is a list of the most common examples:

  • Linear regression
  • Logistic regression
  • Decision trees
  • Support-vector machines
  • K-nearest neighbor
  • Decision Forests (e.g., Random Forest, XGBoost)
  • Deep Learning Supervised lLearning algorithms

What Supervised Learning Algorithms Are Good For

We use Supervised Learning algorithms when we have prior known values or ground truth about a particular output variable and want to predict it with the help of related input features.

The fundamental characteristics of these use cases are the value of predicting the target label and an abundance of data describing its relationship with the input features.

Here are common real-world examples:

  • Stock price prediction
  • Image identification
  • Customer churn prediction
  • Spam detection
  • Weather forecasting

Engineering Challenges with Supervised Algorithms

Software engineering issues specific to Supervised algorithms relate to the use of training data and their performance assessment:

  • Sampling. We might not be able to clearly separate training and testing sets; the target features might be imbalanced; our training dataset might not represent well the data used in real-world applications.
  • Data leakage.  Information in the training set might affect the test set due to bad data processing.
  • Irrelevant training features. We train the model on input features that have little predictive power.
  • Misuse of metrics. We might use the wrong performance metrics or misunderstand their meaning
  • Concept drift. The relationship between the input features and the target labels changes making our model’s performance to fluctuate.

Unsupervised Machine Learning

Unsupervised Learning is a branch of Machine Learning where we apply statistical learning methods to understand our data or create a better representation of it. In this case, we do not have explicit labels.

[U]supervised learning describes the somewhat more challenging situation in which for every observation , we observe a vector of measurements but no associated response .

(source)

We do not have a narrow aim like predicting the target label with unsupervised learning as in the supervised case. Instead, we use it for a wider range of different purposes:

  • Understand the underlying structure of the data.
  • Identify and generate unrecognized groups and features.
  • Have a better representation of the data for further modeling.

The underlying algorithms used for Unsupervised problems vary as the different use cases require different approaches and learning methods.

Types of Unsupervised Algorithms

Based on their intended use, unsupervised algorithms fall into the following categories:

  • Clustering
  • Anomaly detection
  • Dimensionality reduction
  • Association rule learning
  • Autoencoders
  • Pre-training within Deep Learning algorithms

What Unsupervised Learning Algorithms Are Good For

Most customer-facing use cases of Unsupervised Learning involve data exploration, grouping, and a better understanding of the data. In Machine Learning engineering, they can enhance the input of Supervised Learning algorithms and be part of a multi-layered neural network.

Specific examples:

Engineering Challenges of Unsupervised Algorithms

The main engineering challenges specific to Unsupervised Learning algorithms come from the lack of target labels and the open-ended nature of the problems. Below you find a summary of challenges researchers identified in Unsupervised applications of networking:

  • Unverifiable. We don’t know the true structure of the data or even the number of clusters, so we have to assess the model’s performance by subjective means.
  • Interpretability. The results might be hard to interpret or even meaningless.
  • Need for supervision. We cannot apply the outcomes automatically as they often require human assessment and intervention.
  • Misalignment with goals. The generated representation might not align with the intended application.

Let’s demonstrate one of the simple issues. In the following example, we cluster the iris dataset with K-means. We realistically pretend that we do not know the exact number of clusters and generate five of them.

from sklearn.cluster import KMeans

kmeans_model = KMeans(n_clusters=5, random_state=random_seed)
kmeans_predictions = kmeans_model.fit_predict(features)
Kmeans_predictions

The model successfully produces five distinct clusters that might even resemble the “true” Iris types. However, it cannot tell us on its own if the results are a good representation of the categories. To recognize this, we have to get additional information such as manual labeling or a domain expert’s opinion.

Supervised Learning vs. Unsupervised Learning

Let’s summarize what we have learned in this article.

In Supervised Learning, you have information about the relationship between predictor features and target labels, and you try to predict or infer labels in unseen data.

In contrast, Unsupervised Learning does not have labels and you try to identify the structure of your data or generate a more effective representation.

In probabilistic terms, Supervised Learning requires you to infer the conditional probability distribution of the output conditioned on the input data. In Unsupervised Learning, you try to infer your data’s a priori probability distribution.

These differences are not always as clear-cut in real life, as the wide use of Semi-supervised Learning attests. You can approach a data science problem with different combinations of Supervised and Unsupervised Learning algorithms. However, you may want to keep in mind which algorithm to use in which situation and what engineering challenge to look out for.

In this article, we identified a few of these challenges. For Supervised models, the issues are about training data integrity and usability. Problems with Unsupervised ones have more to do with correctly interpreting the results and preventing the automatic production of wrong results.

To systematically address these challenges, you should build and maintain a continuous validation framework as it will provide the necessary checks for your models. You will trust your models and their output more when you have such a framework in place.

With Deepchecks, you can continuously validate your Machine Learning pipeline for data integrity, model confidence, and statistical learning issues. You can make use of it specifically in the following challenges:

  • Observability
  • Alerting
  • Querying
  • Mismatching
  • Analysis

Depchecks validates for all three types of concept drift (label, prediction, and data) by comparing the independent and joint distributions of features and target labels. It also detects leakage by verifying that datasets are split correctly and that no label is used as part of the features. And, it produces integrity alerts when the data schema changes or does not match between production and training.

Would you like to learn more about how Deepchecks validates Machine Learning pipelines? Check out our case studies page, where you will learn about case-specific solutions.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Deepchecks Hub Our GithubOpen Source

Recent Blog Posts

Reducing Bias and Ensuring Fairness in Machine Learning
Reducing Bias and Ensuring Fairness in Machine Learning
×

Event
Testing your NLP Models:
Hands-On Tutorial
March 29th, 2023    18:00 PM IDT

Days
:
Hours
:
Minutes
:
Seconds
Register Now