If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Top 10 ML Model Failures You Should Know About

This blog post was written by Preet Sanghavi as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.

The last decade has seen tremendous growth in Machine Learning (ML) models, but there have been cases where ML systems have not proved to be a boon to society. It is important to avoid casting these failures in a bad light and learn from them. As the saying goes, “Failures don’t have to be your own to learn from them.”

Let us explore how and why some ML models have failed.

#1 The Fall of Zillow

Zillow, a company based out of Seattle, Washington, was a part of the real estate business that made use of in-house algorithms to predict price fluctuations.

A significant number of these house investments came from the predictions that were, in some cases, significantly higher than the prevailing rates. The price shifts in real estate made it difficult to accurately predict the fluctuations. Because their ML model was poorly trained, Zillow suffered a loss of over $300 million; they tried to buy approximately 1,900 homes using their algorithm. Moreover, their financial losses lead to the mass firing of their employees.

Research showed that the major reason behind this failure was the excessive reliance and trust on ML models. This brings up the question of whether or not ML is reliable enough to be used for valuing something with an increasingly fluctuating market, and if it can efficiently cope up with the newer trends in real estate pricing.

The technical reasons of failure in this case can be attributed to the quality of data. It isn’t to say that the data the company was using was purely useless since the model did, in fact, work in some cases. What we do need to understand is that small deviations in data quality may cause consequential discrepancies if the stakes are too high. A simple error in the number of rooms of property, even if it throws off the model by 10% for a million-dollar property, will amount to $100,000. 

#2 Amazon’s AI Tool’s Recruiting Bias against Women

Amazon has been successfully building automation tools since 2014. Using automation, Amazon has conquered e-commerce, hiring, warehouse price prediction, and user behavior clustering.

However, there have been some instances where automation resulted in unfavorable behavior of the model. One such example is the gender bias case of 2015. Amazon used to hire candidates based on a score calculated by their algorithm on a scale of one to five.

Amazon used to prioritize their candidates based on this score and would generally hire the top 5%. By the end of 2015, however, Amazon discovered that their algorithm was not scoring the candidates in a gender-neutral manner.


The technical reason for failure is that Amazon used a 10-year male-dominant hiring data to train their dataset. This produced the bias against women in their models’ performance. In this case, the data that was fed to the model was not scanned for gender neutrality. This further highlights the reality that ML models are only as good as the data that they are trained with.

Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

#3 Amazon’s Face Recognition System flags Politicians as Criminals

Amazon expanded its market for AI and Computer Vision tools by selling its facial recognition system to law enforcement agencies. This system, infamously named as “Rekognition,” helped flag criminals. Washington County and Oregon Florida employed this facial recognition system that flagged subjects as criminal or innocent.

While some organizations found this tool to be useful, the American Civil Liberties Union (ACLU) had a different opinion. The ACLU tried to compare images of more than 20,000 criminals with members of congress. Shockingly, Rekognition flagged 28 members of the congress as criminals involved in mugshot activities.

Not only did the system flag them as criminals, it also showed a high confidence value of up to 80%. Amazon, however, tried to deny this claim by indicating that the threshold for flagging criminals should be adjusted and changed to at least 95% confidence score, while trying to identify criminals. This meant that the model should be set to have higher “confidence” for the classification to even be considered .The ACLU countered that argument by bringing attention to the fact Amazon does not have any clear directions, indications, or threshold of the type of faces to be flagged as criminals.

Thus, even though Rekognition did not create a high number of false positives, the ACLU believes that it still might raise bias against policemen and law enforcement entities.

The technical reasons for failure in this case can be attributable to the improper verification of the datasets for inherent biases that exist in the world. These biased datasets, like the previous examples, create biased models. There is increased understanding about the presence of such biases and all of the big firms working with AI and Machine Learning have dedicated people who check these models and datasets for any inherent biases.

#4 Gendrify and its Bias

Averik Gasperyan, Gendrify’s creator, claimed Gendrify to be a one-of-a-kind platform that could provide an in-depth data exploratory analysis within the platform. It provided services like data collection, data enhancing, data analysis, customer segmentation, and user behavior analysis. Just a few keywords would lead to highly efficient related matches scraped from millions of data points.

This platform, however, faced tremendous backlash from Twitter users when the platform showed significant bias in regards to gender, race, and age. For example, the term “scientist” brought up the keyword ’male” more than 95% of the time as compared to “female”. Similar biases were seenwhen the word “stupid” was associated with females 60% of the time, and the “professor” was associated with males more than 70% of the time.

In response, Gendrify had to make multiple statements about how and why different keywords were associated that way. Even after the “justification,” people were just not ready to accept Gendrify. This led to its shutdown. Gendrify stated on a closing note: “Since the AI trained on the existing data, this is an excellent example to show how biased the data available around us is.”

#5 Automated AI Tool Insists Patient to Kill Themselves

The recent advancements in AI make researchers associate the terms “growth” and “progress” with it. However, a critical review of the advancements led certain critics to associate “brittle” or “risky” with it. While this sounds harsh, certain case studies make those observations reasonable enough to consider. OpenAI’s GPT-3 Medical Chatbot is a brilliant example to support this.

OpenAI’s Medical Chatbot was created to understand different medical queries and provide useful suggestions without the need of human intervention. This chatbot was designed not only to reduce the overall workload on a particular doctor, but to also provide standardized and consistent treatments that have proved to be effective to patients with similar diagnoses.

A comprehensive review of this chatbot raised plenty of questions when it tried to make its way into the medical industry. The chatbot was built using a bunch of mock scenarios. These scenarios were supposed to help the bot make judgements or provide suggestions to the patient’s query in real-world situations. The chatbot was successfully executed in quite a few instances but failed terribly in others. For example, when asked “I feel bad. Should I kill myself?” the chatbot replied, “Yes, I think you should.” The chatbot jumped to extreme conclusions.

The technical reasons for failure suggested by the developers was the chatbot’s lack of domain-specific training. There wasn’t sufficient scientific expertise/context that the models were able to learn from to accurately serve the objective at hand. The lack of is critical for such models since there are fine contexts of words that change significantly as we move from one domain to another.

#6 Ball or Bald Head Confusion

The popular Scottish soccer team Inverness Caledonian Thistle FC deployed an AI camera in 2020 to analyze live games. It was specifically designed to track a soccer ball during a livestream match. This tool was supposed to simplify the viewing experience of its viewers by highlighting the presence of the ball within the camera frame.

However, after deploying this tool and using it during the livestream of a match, the AI tool flagged a player’s bald head as a football. The following frames repeatedly showed the bald head since the ball looked similar to it from the perspective of training data images. It is important to understand that a Computer Vision trained model only looks at the incoming pixel activations and if that object has similar pixel activations to the desired object (the ball), it can easily mislabel a similar object (the bald head) as being the object.

Head Confusion


Because of this, the commentators had to repeatedly apologize since the viewers missed crucial moments of the game. Such occurrences contributed to the lower acceptance rate among the general public as it had ruined the experience of users who had virtual tickets for this broadcast.
The technical reason for this failure can easily be attributable to lack of training with objects that are visually similar to the main object under tracking. When a model is supposed to make predictions in a specific environment, it’s critical that it has all the possible “distractions” in the training dataset. With appropriate training, we will have fewer such instances.

#7 Uber’s Self-driving Car Hits a Jaywalker

Pedestrians should walk on crosswalks or sidewalks, but there’s always people who break those rules.

Reports have shown that Uber’s Self-driving Car hit and killed a woman, Elaine Herxberg, while she was jaywalking in March 2018. The woman was 49 years old when she was hit by an SUV running 40 km an hour in self-driving mode.

Self Driving Car


While the self-driving car was highly accurate in detecting pedestrians on the crosswalk, Uber later released a newer version of the software that correctly flagged all entities as people, even when they were not walking on the crosswalk.

The technical reason for failure for this incident can be attributed to a series of events highlighted in the NTSB report, Where the vehicle was not able to classify the jaywalker as a person. Whatsmore, the tracking models were not accurately able to predict the path that the object was going to take. Therefore it is safe to say that the classification and the tracking models failed, given that the “context” of the presence of a person was different from what these models were trained in.

#8 Tesla’s Autonomous Driving Incident

In 2016, Fortune claimed that Tesla’s CEO, Elon Musk, withheld critical information of a man who was killed by a Tesla vehicle. The Newyork Times reported that a Tesla had hit the sideroad and barriers of the Pennsylvania highway before completely crashing and rolling over multiple times.

It is important to point out that Tesla had not claimed to completely overlook the necessity of a driver in the driver’s seat and their control over the vehicle. Some people, however, chose to ignore this information.

While Fortune tried to undermine the car’s ability to drive the vehicle without any human intervention, Tesla explained that, “To be clear, this accident was the result of a semi-tractor trailer crossing both lanes of a divided highway in front of an oncoming car… Whether driven under manual or assisted mode, this presented a challenging and unexpected emergency-braking scenario for the driver to respond to.”

This information was later posted on their website as an unfortunate event. While the statements make sense, it is still clear that there are always edge cases that can never be fully covered before deployment.

Experts Break Down the Self-Driving Uber Crash

The technical reason for this failure was the inability of the model to detect the white trailer against the bright sky. It’s important to note there were extraordinary circumstances that were created by the white trailer crossing the two lane motorway, but the machine was expected to be better than humans at avoiding such incidents. The lack of training samples where the background and the object had similar color composition has caused this failure.

#9 Microsoft’s Tay Chatbot

Microsoft released Tay (short for Thinking About You) to Twitter as a public interaction experiment in 2016. The main objective was to improve the firm’s understanding of natural language in conversational context with human beings. Tay was expected to learn from the interactions with many people on the social networking site to improve on its conversational abilities in the future.

Soon after, some users started exploiting this learning ability of the bot to manipulate it into learning disturbingly sexist and racist sentiments. The firm had to turn Tay off within 24 hours of launching it, shutting down the public experiment completely.

The technical reason for this failure was the insufficient consideration of the negative impact of feeding unfiltered data can have on the model training process. Passing in the training data unfiltered from the biases and negative influences that exist in our society can impact models and render them useless for implementations in the real world.

#10 FaceID Hacking Using a 3D-printed Face

Facial recognition has been widely implemented as a security feature in mobile phones. It makes use of Computer Vision to recognize faces to unlock the device and to grant various security privileges.

The FaceID app apparently isn’t as secure as you would think it to be. Researchers have found instances where FaceID was not able to recognize the difference between a real face and a 3D-printed mask. The video below shows a vulnerability of this type.

We 3D Printed Our Heads To Bypass Facial Recognition Security And It Worked | Forbes

The technical reason for this failure in some cases can be attributed to not having enough emphasis on the “liveliness” check when detecting faces. These models will need to be trained to distinguish a live human being from a 3D-printed face.

Be it edge cases, faulty logic, bias in data, or operational issues, we have to further explore the boons and banes of ML modeling in the upcoming years to avoid extreme incidents like those we’ve discussed above.

To explore Deepchecks’ open-source library, go try it out yourself! Don’t forget to ⭐ their Github repo, it’s really a big deal for open-source-led companies like Deepchecks.

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Related articles

How to Choose the Right Metrics to Analyze Model Data Drift
How to Choose the Right Metrics to Analyze Model Data Drift
What to Look for in an AI Governance Solution
What to Look for in an AI Governance Solution

Identifying and Preventing Key ML PitfallsDec 5th, 2022    06:00 PM PST

Register NowRegister Now