How to Create Unbiased ML Models


AI systems are becoming increasingly popular and central in many industries. They decide who might get a loan from the bank, whether an individual should be convicted, and we may even entrust them with our lives in systems such as autonomous vehicles in the near future. Thus, there is a growing need for mechanisms to harness and control these systems so that we may ensure that they behave as desired.

One important issue that has been researched widely in the last few years is fairness. While usually ML models are evaluated based on metrics such as accuracy, the idea of fairness is that we must ensure that our models are unbiased with regard to attributes such as gender, race, and other selected attributes.

A classic example of an episode regarding racial bias in AI systems is the COMPAS software system, developed by Northpointe, which aims to assist US courts with assessing the likelihood of a defendant becoming a recidivist. Propublica published an article that claims that this system is biased against blacks, giving them higher risk ratings.

ML system bias against African Americans

ML system bias against African Americans? (source)

In this post, we will try to understand what exactly a fair ML model is, how to detect biases in our models, and how to create models that are unbiased.

Where Does Bias Come From?

“Humans are the weakest link”

Remember, an ML model can only be as good as the data it’s trained on, and thus if the training data contains biases, we can expect our model to mimic those same biases.

Some nice examples for this can be found in the field of word embeddings in NLP. Word embeddings are learned dense vector representations of words, that are meant to capture semantic information of a word, which can then be fed to ML models for different downstream tasks. Thus, for example, embeddings of words with similar meanings are expected to be “close” to each other.

Word embeddings can capture the semantic meaning of words

Word embeddings can capture the semantic meaning of words (source)


It turns out that the embedded space can be used to extract relations between words and to find analogies as well. A classic example for this is the well-known king-man+woman=queen equation. However, if we substitute the word “doctor” for the word “king” we get “nurse” as the female equivalent of the “doctor”. This undesired result simply reflects existing gender biases in our society and history. If in most available texts doctors are generally male and nurses are generally female, that’s what our model will understand.


# word analogy example
# king is to man as what is to woman?
king = nlp.vocab['king']
man = nlp.vocab['man']
woman = nlp.vocab['woman']
result = king.vector - man.vector + woman.vector

Output: queen

# word analogy example
# doctor is to man as what is to woman?
doctor = nlp.vocab['doctor']
man = nlp.vocab['man']
woman = nlp.vocab['woman']
result = doctor.vector - man.vector + woman.vector

Output: nurse

Code example: man is to doctor as woman is to nurse according to gensim word2vec (source)

Culture-Specific Tendencies

Currently, the most used language on the internet is English. Much of the research and products in the field of Data Science and ML are done in English as well. Thus, many of the “natural” datasets that are used to create huge language models tend to match American thought and culture and maybe biased towards other nationalities and cultures.

GPT-2 needs active steering in order to produce a positive paragraph with the given prompt

Cultural bias: GPT-2 needs active steering in order to produce a positive paragraph with the given prompt (source)

Synthetic Datasets

Some biases in the data may be created unintentionally in the process of the dataset’s construction. During construction and evaluation, people are more likely to notice and pay attention to details they are familiar with. A well-known example of an image classification mistake is when Google Photos misclassified black people as gorillas. While a single misclassification of this sort may not have a strong impact on the overall evaluation metrics, it is a sensitive issue and could have a large impact on the product and the way customers relate to it. 

Misclassification of black people as gorillas
Racist AI algorithm? Misclassification of black people as gorillas (source)

Biased datasets result in biased models
Biased datasets result in biased models, popular object recognition datasets contain images mostly from the US and GB (source)


In conclusion, no dataset is perfect. Whether a dataset is handcrafted or “natural”, it is likely to reflect the biases of its creators, and thus the resulting model will contain the same biases as well.

Can We De-Bias the Data?

The power of ML comes from the ability to leverage large amounts of data in order to identify patterns and make decisions. Thus it is not realistic to manually validate that large datasets are unbiased. However, when constructing synthetic datasets it is important to be aware of potential biases and attempt to minimize them.

Testing for Fairness

Before discussing how to test for fairness, we need to have some formal definitions for a fair model.

“In machine learning, a given algorithm is said to be fair, or to have fairness, if its results are independent of given variables, especially those considered sensitive, such as the traits of individuals which should not correlate with the outcome (i.e. gender, ethnicity, sexual orientation, disability, etc.).”

  • “Fairness”, Wikipedia


Statistical Tests

We are interested in ensuring our model is blind to these sensitive attributes in some sense. One common mathematical definition for fairness in classification problems uses the notion of statistical independence.

Definition of independence (source)

We can also relax the criterion to include some slack, and then we require:

This condition corresponds to the “80% Rule” in disparate impact law.

Another “weaker” notion of fairness is sufficiency, for which we require:

This essentially means that we require similar per-class precision rates for different groups. Thus we do not require complete independence, but only that the rate of “mistakes” of our algorithm is constant for different groups.

These criteria can be tested empirically using a simple evaluation process on the dataset.

Monitoring Prediction Explanations

Additionally, using explainable models can assist in detecting bias. For example, if we expect a specific protected characteristic to have no correlation with the prediction we can assert that it has a small influence on the model’s final prediction. Furthermore, an explanation for a prediction lets us look into the model’s decision process, enabling us to validate the model is making its predictions for the right reasons.

Visualizing gender bias for loan application prediction: gender=male has positive weight

Visualizing gender bias for loan application prediction: gender=male has positive weight (source)

Creating fair ML models

There are multiple proposed methods for creating fair ML models, which generally fall into one of the following stages.


A naive approach to creating ML models that are unbiased with respect to sensitive attributes is to simply remove these attributes from the data, so that the model cannot use them for its prediction. However, it is not always straightforward to divide attributes into clear-cut categories. For example, a person’s name may be correlated with their gender or ethnicity, nevertheless we would not necessarily want to regard this attribute as sensitive.


More sophisticated approaches attempt to use dimensionality reduction in order to eliminate sensitive attributes (see here).


At Training Time

An elegant method for creating unbiased ML models is using adversarial debiasing. In this method, we simultaneously train two models. The adversary model is trained to predict the protected attributes given the predictors prediction or hidden representation. The predictor is trained to succeed on the original task while making the adversary fail, thus minimizing the bias.

Adversarial debiasing illustration, the predictor loss function consists of two terms, the predictor loss, and the adversarial loss (source)

This method can achieve great results for debiasing models without having to “throw away” the input data, however, it may suffer from difficulties that arise in general when training adversarial networks.


Post Processing

In the post-processing stage, we get the model’s predictions as probabilities, but we can still choose how to act based on these outputs, for example, we can move the decision threshold for different groups in order to meet our fairness requirements.

One way to ensure model fairness in the post-processing stage is to look at the intersection of the area under the ROC curve for all groups. The intersection represents TPRs and FPRs that can be achieved for all classes simultaneously. Note that in order to satisfy the desired result of equal TPRs and FPRs for all classes one might need to purposefully choose to get less good results on some of the classes.

The colored region is what’s achievable while fulfilling the separability criterion for fairness (source)


Another method for debiasing a model in the post-processing stage (w.r.t the sufficiency definition of fairness) involves calibrating the predictions for each class independently.

Calibration is a method for ensuring that the probability outputs of a classification model indeed reflect the matching ratio in the data. Formally, a classification model is calibrated if for each value of r:

Apparently, per-class calibration is a sufficient requirement for the sufficiency criterion of fairness.



In conclusion, we have discussed the concepts of bias and fairness in the ML world, we have seen that model biases often reflect existing biases in society. There are various ways in which we could enforce and test for fairness in our models, and hopefully, using these methods will lead to more just decision-making in AI-assisted systems around the world.

Further Reading

Gender bias word embedding
Fair ML book
Understanding Data Bias
Survey on bias and fairness
Adversarial debiasing 1
Adversarial debiasing 2
CS 294: Fairness in Machine Learning
AI Fairness 360
Machine bias risk assessments in criminal sentencing
Amazon scraps secret AI recruiting tool that showed bias against women



Related articles

When You Shouldn’t Use Ensemble Learning

Train ML Systems
Using Competition to Train ML Systems

A Guide to Evaluation Metrics for Classification Models

Subscribe to our newsletter

Do you want to stay informed?
Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Subscribe to our newsletter: