This blog post was written by Inderjit Singh as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via firstname.lastname@example.org. We typically pay a symbolic fee for content that’s accepted by our reviewers.
The failure analysis, in general, is a technical procedure used to investigate the root cause of the failure of a specific product, equipment. It can be caused by an unintentional mistake that was made during the designing period, or any new unforeseeable problem that has hindered or reduced the ability of the product to execute its intended purpose. Much like in Machine Learning it’s as important in any other field to perform failure analysis in order to create a production-capable pipeline that is reliable and robust. Here, we discuss various types of failures that can occur in a Machine Learning pipeline, and some of the open source implementations available to prevent or mitigate such failures with the least overhead.Let’s first look into the types of failures that can occur in a production pipeline.
Types of Failure There are different types of failures in ML systems. We will look at the top three types I have encountered in my Machine Learning deployment experiences.
As the name suggests, performance failures are failures that are caused by the presence of specific sub-groups in the dataset that remain hidden with the use of basic techniques. E.g. presence of bias towards specific population minorities in dataset is not detected by basic EDA process This can lead to the introduction of bias against a specific subgroup or underfitting for the same. The most common cause of performance failure is usually a poor understanding of the domain and a poorly implemented EDA process, that was not thorough enough to determine the presence of the subgroups in the underlying dataset. There is significant literature available on the types of bias and the possible remedies. Some of the significant types of biases are listed in the image below:
A good example of performance failure is a personalization model in an e-commerce website where we decided to work with short-term data for personalized recommendations, with the objective of increasing the click-through rate. Now if our training, validation, and hold-out set are all part of the same small sample (i.e., Taking into consideration only the dataset for a small time duration), we might end up creating a model that only takes into consideration short-term trends in user behavior and end up reducing engagement in the longer run since we are only taking a small sub-sample of the dataset.
This type of failure is referred to as an inconsistent behavior of a model usually without any warnings due to validity violations of features. For example, computer vision models are sensitive to the resolution of the incoming images. If a proper validation is not created in the pre-processing pipeline, we are likely to get inconsistent results from the very same model that was able to achieve great accuracy in the development process. In testing environments and development cycles where we have a defined set of datasets, we are less likely to encounter inconsistent values as the dataset is coming in from a finite range. In production, however, we can have erroneous inputs (for the cv model, it can be a higher or lower resolution image) that can cause silent failures and unpredictable results in the training or prediction pipeline. Some simple checks in a data pipeline for failure prediction in Machine Learning can be:
– Feature values must have a specific range when coming in.
– Ensure that the data pipeline is capable of handling type conversion without affecting the downstream model’s performance.
– Check to ensure that the required features critical for prediction are present. We can categorize individual features in terms of them being critical and optional (i.e., Features that can be assumed as missing, e.g. <unknown> or <unk> tokens in NLP ). The custom code we will write will flag any erroneous changes in the upstream data ingestion pipeline and help us create a reliable pipeline for predictions. The image below shows some of the reasons why we have model failures in production:
These are failures in the robustness of the model and can generally be referred to as the inability of the model to withstand perturbations and small changes in the input features (e.g. blur for vision-based implementations). A Robust model is tough in the face of such changes and will continue to give consistent results.
For example, in an CNN (Convolutional Neural Network), the introduction of first layer activations into the image pixels of a different class can cause some models to give completely different results.The following image shows how a small perturbation of the input pixels can cause a significant deviation in the model’s outcome, without causing any significant impact on the human eye. In contrast to the human eye is able to ignore, we are able to ignore any such difference in the same.
Mitigation and Preventive Steps
Now that we are familiar with the extent and impact of the problem, let’s look at some of the open-source tools and methodologies that can help mitigate the negative impact of failures in a production pipeline. We will go through Deepchecks open-source tools to generate failure data for our Machine Learning models.
Deepchecks open-source model validation tool is easy to use and integrable with the readily available and popular modeling libraries like scikit-learn. It allows for validation of the business logic with the presence of features such as:
– Feature Importance Score
– Rreceiver Operating Characteristic (ROC) Reports
– Metric Plots
– Drift Analysis
– Data Validation
– Feature Contribution Analysis
The most relevant feature for our discussion is the out-of-the-box validation that allows anyone to create custom validations with minimal coding. We can create custom pipelines with customized validations. The following code snippet creates validations for Data Distribution, Data Integrity, Methodology, and Performance
– Performance Deepchecks is super easy to install and can be added through a single command
pip install deepchecks
from deepchecks import Dataset from deepchecks.suites import full_suite ds_train = X_train.merge(y_train, left_index=True, right_index=True) ds_test = X_test.merge(y_test, left_index=True, right_index=True) ds_train = Dataset(ds_train, label="WasTheLoanApproved", cat_features=["TypeOfCurrentEmployment", "LoanReason", "DebtsPaid"]) ds_test = Dataset(ds_test, label="WasTheLoanApproved", cat_features=["TypeOfCurrentEmployment", "LoanReason", "DebtsPaid"]) pipeline = full_suite() #clf is any scikit learn classifier pipeline.run(train_dataset=ds_train, test_dataset=ds_test, model=clf)
Microsoft’s Responsible AI ToolboxAnother good open-source framework that is available for Machine Learning practitioners is Microsoft’s Responsible AI toolbox. It contains a number of different dashboards to measure a number of model parameters with a small sample of code. These dashboards allow for users to look at different components of model failure comprehensively. Comprehensive since we are able to look at the fairness component of the model, do error analysis and have interpretability of the model outcomes.. The figure below shows some of the available dashboards:
The module can be easily installed with pip using:
pip install raiwidgets
It is actively maintained and contributed by the community. There might be some limitations as we progress with deep learning-based solutions, but it works well for most of the readily available Machine Learning implementations. The code snippet below shows an example of using the widget with a simple scikit-learn model:
from raiwidgets import ResponsibleAIDashboard from responsibleai import RAIInsights task_type = 'regression' rai_insights = RAIInsights(model, train_data, test_data, target_feature, task_type) rai_insights.explainer.add() rai_insights.error_analysis.add() # specification of the feature that would be changed as a treatment rai_insights.causal.add(treatment_features=['bmi', 'bp', 's2'])
Responsible AI produces great-looking dashboards, which are easy to understand with all its visualizations and intuitive design. A complete notebook example for the use of the widget is accessible here. The example below shows one of the insightful dashboards produced from Error Analysis:
Conclusion As per our discussion of the different reasons for failure in machine learning models, we must not restrict ourselves to these categories when looking for the reasons for failures. Consider looking into Microsoft’s comprehensive documentation around Failure Modes. In the examples, we saw that some of these failures creep in from data, some human-induced and some caused by mere a lack of domain knowledge. The use of open-source tools (some of which we discussed above) can help mitigate these by automating the failure analysis diagnostics and can help prevent catastrophic failures in a production environment. These tools also automate the grunt work, so a Machine Learning engineer can focus on creating an optimal model to maximize the gain of machine learning integration to create real value.