How to Monitor Open-source ML Models

This blog post was written by Tonye Harry as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via We typically pay a symbolic fee for content that’s accepted by our reviewers.


Open-source machine learning (ML) models have become increasingly popular as the principles of open-source facilitate collaboration, transparency, and the democratization of knowledge. These models play a vital role in informing critical business decisions for organizations. In fact, a leaked Google document recently acknowledged that open-source models have become a cause for concern in the AI race with companies like OpenAI. These models are not only faster, they are more customizable and capable. Moreover, they require fewer training resources than proprietary models training billions of parameters, thereby reducing barriers to entry.

Popular ML open-source projects you may recognize include; ResNet-50, AlexNet, VGG-16, and Hugging Face Transformers like BERT, and foundation models like Generative Pre-Trained Transformers (GPT). More recent ones are Stable Diffusion and Alpaca, among others.

They may exist for various use cases, but when you want to use them for your own project, it is important that you monitor them to ensure they are optimally generalizing based on the ground truth.

This article will cover:

  • Setting up a monitoring system
  • Types of monitoring
  • How and what to monitor
  • Importance of monitoring open source models
  • Future trends in this space.

Setting up a Monitoring System

In the scenario where you have utilized transfer learning by leveraging a foundation model and fine-tuned it with your text data to achieve optimal performance in a Natural Language Processing (NLP) use case, for example, it is crucial to recognize that without proper monitoring, the model could deviate from its intended context, resulting in diminished coherence, fluency, and overall performance of the ML system.

This also holds true for various other use cases, and failing to establish a robust monitoring framework can lead to increased costs for your team. Therefore, it becomes imperative to prioritize the implementation of effective model monitoring practices to safeguard the integrity and efficacy of your ML models.

Defining monitoring goals and metrics

When considering your project, it is essential to anticipate potential pitfalls that may arise after deploying your fine-tuned open-source model. In order to mitigate these risks, your monitoring goals should revolve around the following key aspects:

  • Safeguarding Model Performance: Protect your model from degradation by ensuring its accuracy and effectiveness remain intact.
  • Error Detection and Troubleshooting: Detect errors promptly and notify your team to facilitate troubleshooting. This allows for data updates and model retraining to enhance performance.
  • Optimizing Resource Utilization: Ensure optimal resource allocation based on your team’s system requirements and budgetary considerations.
  • Sustaining Pipeline Integrity: Maintain the functionality and efficiency of the various pipelines that have been established.
  • Evaluating Business Impact: Assess the business value and impact of your model’s predictions, gauging its contribution to the overall success of your project.

With these goals in mind, you can proactively address potential challenges, enhance model performance, and maximize the benefits derived from your deployed open-source model.

Monitoring Metrics

To determine which metrics to use, it is important to establish a monitoring framework that aligns with the specific project requirements. This framework should allow you to select the most relevant metrics to track based on the desired measurement levels.

Monitoring metrics for classification or regression tasks using models like RESNET50 or BERT is relatively straightforward. However, evaluating generative language or visual models presents additional complexities, as the output can vary not only by the task but also by the domain in which it is deployed.

At this stage, it becomes crucial to prioritize the metrics that align with your objectives. In the case of Natural Language Generative (NLG) models, relying solely on accuracy as a metric may not provide a complete picture of the model’s performance.

Depending on the type of task, for example, classification, regression, and generative tasks, there might be different considerations when selecting metrics. Metrics like accuracy, precision, F1-score, mean squared error (MSE), etc., are common metrics for classification and regression tasks, but there might be several challenges when selecting metrics for generative foundation models, which include:

  • Multiple Variations of Effective Input Prompts: With generative models, finding the most effective input prompts can be challenging. Different input prompts may yield varied outputs, making it important to consider the range of inputs during evaluation.
  • Ambiguous Output Format: The format of the model’s output may not be well-defined or straightforward. It could contain ambiguities or require additional interpretation, posing challenges in determining the quality and appropriateness of the generated output.
  • Multiple Measures for Output: There may be multiple dimensions to consider when evaluating the output, such as confidence level (truthfulness), fluency, toxicity, bias, etc. It is essential to carefully consider and prioritize the most relevant measures for the specific application and domain.

You should think through your task and investigate cost-effective ways to measure model performance. A better way to know what metrics to prioritize, here are a few questions you can go through with your team.

Table 1 below shows a list of potential questions you should ask in order to properly monitor different aspects of your open-source model.

ReasonKey Questions
Performance degradation
  • Is the model’s accuracy decreasing over time?
  • Are there an increased number of prediction errors?
  • Is the model meeting the desired performance metrics?
Data drift detection
  • Is the distribution of input data changing significantly?
  • Are there any shifts in the statistical properties of the data?
  • Is the model still generalizing well to new data?
Concept shift identification
  • Are the relationships between input features and target outputs changing?
  • Are there external factors or system updates that may impact the model’s performance?
  • Do adjustments or retraining need to be performed to accommodate the concept shift?
Model fairness and bias detection
  • Are there any biases in the model’s predictions?
  • Are certain demographic groups disproportionately affected by the model’s decisions?
  • Are fairness metrics like disparate impact or equal opportunity being violated?
  • How toxic is the generated text or image output?
Infrastructure and resource optimization
  • Is the model utilizing resources efficiently?
  • Are there any bottlenecks or scalability issues in the system?
  • Is the response time satisfactory?
Compliance with regulations and standards
  • Does the model adhere to legal and regulatory requirements?
  • Is user privacy adequately protected?
  • Does the model comply with industry-specific standards such as GDPR?
Model interpretability and explainability
  • Can the model’s decisions be explained and understood?
  • Are there methods in place to interpret the model’s predictions?
  • Are influential features identified and properly communicated?
Early anomaly detection and incident response
  • Are there any anomalies, outliers, or errors in real-time predictions?
  • Is there a mechanism to receive alerts and notifications for prompt incident response?
  • Is the system resilient to potential issues, ensuring minimal downtime and user impact?
Continuous model improvement and adaptation
  • Are there areas for improvement based on ongoing monitoring?
  • Is user feedback being collected and incorporated into model updates?
  • Is there a process for iteratively updating and retraining the model?
Business and financial impact evaluation
  • What is the business impact of the model’s predictions?
  • Are the desired outcomes being achieved?
  • Is the model delivering a positive return on investment?
Auditability, compliance, and accountability
  • Can the model’s decisions be audited and traced back?
  • Are there records of predictions, input data, and model behavior?
  • Does the monitoring process comply with regulatory requirements?

Model Performance Monitoring

Model Performance Monitoring

Fig. 1: Image showing deepchecks library capabilities Source

Open-source models such as RESNET-50 and XGBoost are widely utilized for various tasks, including classification, segmentation, regression, and feature extraction. These models can be effectively monitored using modern MLOps tools like Deepchecks Hub, which offer convenient features for tracking their performance.

Performance evaluation of these models typically involves assessing metrics such as accuracy, F1 score, precision, and RMSE (Root Mean Square Error). These metrics provide valuable insights into the model’s effectiveness and can be easily incorporated into the monitoring process.

When dealing with contemporary foundation models like LLaMA, which is known for its intricacies, additional evaluation metrics become necessary to establish a comprehensive performance monitoring framework. These metrics go beyond traditional measures and are specifically designed to capture the nuances and complexities of large language models (LLMs).

To effectively monitor both traditional and foundation open-source models, it is essential to focus on two key areas:

  • Data Monitoring
  • Model and Prediction Monitoring

Data and Prediction Monitoring

Traditional monitoring techniques, such as tracking metrics and evaluating accuracy, are not always effective for foundation models. This is because foundation models are trained on such large datasets that it can be difficult to identify patterns in the data that could indicate a problem.

Due to its widespread popularity and people’s exposure to LLMs in recent times, this article focuses on monitoring large language foundational models. Also, traditional open-source models can easily be monitored, but general-purpose LLMs need different levels of monitoring to effectively manage the model’s performance. If you want to monitor traditional models only, here is a comprehensive guide.

Writing instructions in natural language (prompt engineering) is inherently more flexible but also challenging. This flexibility arises due to the user-defined nature of the instructions, which can vary across individuals and cultures. In contrast, programming languages offer higher precision and exactness in their instructions.

To attempt monitoring data input for GPT, diffusion models, etc., for example, you should consider a more human-centric approach to evaluate, version, and optimize your prompts:


To evaluate your input data and its results, you can utilize prompt engineering to provide your model with different varied examples that test its ability to generalize from the examples provided.

  • User Feedback: Collect feedback from users who interact with the prompts and generated outputs. Conduct surveys or interviews to understand their satisfaction and whether the prompts effectively achieve their intended goals.

OpenAI does this with ChatGPT in a subtle way to improve their responses. They can get your feedback by having you click if you like the output you obtain from the generative AI.

  • Compare Variations: Compare different prompts or variations in terms of their impact on output quality.

For example, compare prompts with different levels of specificity or different contexts to see which ones produce more coherent and relevant outputs.

The idea around evaluating the data inputs is to know if the language model understands the prompt examples provide and if the model overfits on the examples you have given


Utilize tools like Git to manage prompt versions, allowing you to easily compare different versions, revert changes if necessary, and collaborate with others on prompt development. Keep in mind that slight changes to prompts can lead to different results.


  • Experiment with Modifications: Modify prompts by adjusting wording, providing additional context, or specifying constraints to influence the output. For instance, experiment with asking questions in different ways of incorporating specific keywords or breaking down prompts used into simpler prompts.
  • A/B Testing: Generate different variations of the prompt’s output and ask the LLM to vote for the best one (self-consistency). Compare the output quality, user satisfaction, or other relevant metrics to determine the most effective prompt variations.

COT or Chain-of-Thought technique can also be applied to explain how it arrives at answers to your prompts.

Chain-of-Though prompting example

Fig 2: Image showing a Chain-of-Though prompting example. Source

  • Iterate and Refine: Based on the insights gained from evaluation, user feedback, and experimentation, iteratively refine prompts. Incorporate successful modifications into newer versions of the prompts, discarding or further optimizing less effective ones.

Model Monitoring

This encompasses tracking the model’s performance with the help of metrics and the model’s output quality. By monitoring these, any issues or changes in the model’s performance can be identified early, allowing for timely intervention and improvement.

Deepchecks Hub helps you do this effortlessly for Computer vision and NLP models, although alternatively, you can use MLOps platforms that cater specifically to the use case you are working on. These platforms enable seamless collaborations and alert feedback loops to immediately troubleshoot anything going wrong, like model or data drifts.

Since accuracy doesn’t capture everything, Stanford researchers developed the Holistic Evaluation of Language Models (HELM) to effectively benchmark LLMs.

It has multi-metric measurements that go beyond isolated metrics like accuracy. These include:

Accuracy: The model’s ability to make correct predictions compared to the ground truth labels.

Calibration: The alignment between the predicted and true probabilities of the model’s predictions. In other words, what’s the confidence level of the model? A well-calibrated model provides reliable probability estimates.

Robustness: The model’s resilience to variations or perturbations in the input data. Robust models exhibit consistent performance across different scenarios or data distributions.

Fairness: The assessment of whether the model’s predictions exhibit bias or discrimination against certain groups or protected attributes. Fairness metrics help identify and mitigate any unjust or biased behavior.

Bias: The identification and mitigation of bias in the model’s predictions or decision-making processes. Bias metrics aim to uncover and address any systematic disparities in outcomes.

Toxicity: The measurement of potentially harmful or offensive content generated by the model. Toxicity metrics assist in detecting and mitigating the generation of inappropriate or undesirable outputs.

Efficiency: The evaluation of the model’s computational efficiency, such as inference speed, memory usage, or energy consumption. Efficient models optimize resource utilization and can be deployed in real-time or resource-constrained environments.

Metrics and Libraries for LLM Monitoring

Use metrics such as BLEU score, ROUGE score, or CLIP score (for computer vision models) to measure the quality and similarity of reference texts to text or texts to image generation. Note that this depends on the task or scenario and domain, so you can pick from a plethora of evaluation metrics as long as they meet your requirements.

The diagram below shows the different types of evaluation metrics used to ensure that LLMs work as expected.

Metrics and Libraries for LLM Monitoring

Fig 3: This figure shows the different types of evaluation metrics for foundation LLMs. Source

There are libraries like Evals and FlagEval that attempt to evaluate LLM models, providing valuable insights and performance metrics for model assessment and comparison. It is an emerging field, and evaluation tools will improve with time.


How to Monitor Open-source ML Models

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Infrastructure Monitoring

In infrastructure monitoring, you monitor the operation of both hardware and software dependencies utilized in training the model or post-deployment utilization by the end-user. To monitor this effectively, you should think of monitoring the following:

  • System performance and reliability
  • Pipelines (Data and Model)
  • Cost and Latency

System performance and reliability

In infrastructure monitoring, system performance and reliability refer to tracking and analyzing the behavior and stability of the underlying hardware and software components that support machine learning workflows. This includes monitoring various system-level metrics such as:

  • CPU and memory utilization
    Example: Is the CPU or memory utilized at an optimal rate?
  • Network throughput
    Example: Is there network congestion limiting your model from transferring data efficiently?
  • I/O
    Example: Number of reads and writes per second
  • Response times
    Example: Response times of an API endpoint serving predictions

This enables you to identify potential bottlenecks, resource constraints, or anomalies to ensure optimal system performance. It allows your MLOps team to take proactive measures such as scaling up resources or optimizing the model to improve overall system performance.

Pipelines (Data and Model)

Monitoring pipelines involve tracking the flow and quality of both data and models throughout the entire machine-learning lifecycle. This includes monitoring:

  • Data ingestion
    Example: Monitoring the speed and accuracy of data ingestion processes, ensuring that data is successfully captured and integrated into the system.
  • Preprocessing
    Example: Tracking the efficiency and correctness of data preprocessing steps, such as data cleaning, normalization, or transformation, to ensure high-quality input for the machine learning pipeline.
  • Feature engineering
    Example: Assessing the impact of engineered features on model performance.
  • Model training
    Example: Tracking the progress and resource utilization during model training, including metrics like loss, accuracy, and convergence rate, to ensure successful and efficient training.
  • Evaluation
    Example: Monitoring the performance of the trained model on validation or test datasets, assessing metrics such as precision, recall, F1 score, or area under the curve (AUC) to measure its effectiveness.
  • Deployment
    Example: Monitoring the health and stability of the deployed model, including tracking prediction latencies, error rates, or resource consumption, to ensure a reliable and efficient model serving in production.

Monitoring data pipelines ensures data is ingested correctly, transformations are applied accurately, and potential issues like data drift or missing values are identified.

Monitoring model pipelines involves tracking model training progress, performance metrics, model versioning, and model deployment health.

Cost and latency


The cost of infrastructure monitoring for traditional models depends on the pricing model of the chosen infrastructure provider. For example, cloud service providers may charge based on the number of instances, storage capacity, or network bandwidth used for monitoring. These can be less expensive because of the broad number of MLOps services out there, especially if they are open source.

On the other hand, infrastructure monitoring for foundation models can be more resource-intensive and potentially more expensive. Foundation models often require substantial computational resources for training and fine-tuning. Monitoring the infrastructure for such models involves tracking the utilization of high-performance computing resources, which can come with higher costs than traditional models.


The latency for infrastructure monitoring of traditional models depends on the frequency of data collection and the efficiency of the monitoring system. Monitoring systems may introduce a slight overhead in terms of CPU and memory usage, which can impact the overall latency of the system. However, this latency is usually negligible compared to the primary workload of the model itself.

Monitoring the infrastructure for foundation models may introduce additional latency due to the computational demands of the monitoring system. The monitoring system needs to collect and process large amounts of data from the foundation models, which can add to the overall latency. The latency impact can vary depending on the specific monitoring techniques used and the scale of the foundation model being monitored.

Table 2 shows the difference between traditional and foundation open-source models with a variety of key factors.

FactorsTraditional ModelsFoundation Models
Model ComplexityModerateHigh
Resource UtilizationLowerHigher
Training CostLowerHigher
Inference LatencyVariesPotentially Higher
Monitoring CostLowerPotentially Higher
Model PerformanceDepend on taskState-of-the-art
Data RequirementsModerateHigh
InterpretabilityEasierMore Challenging
Token Training CostN/ASignificant

Additionally, the cost of tokens used to train large language models (LLMs) is a significant consideration, impacting the overall training budget and necessitating careful evaluation and budgeting for LLM training projects. Open-source LLMs are cheap to train and are gradually getting more powerful as time goes by, meaning that it takes less money per token to do more experimentation or iterations.

Future Trends

The future of open-source ML monitoring seems to only get better. As ML models become more complex and sophisticated, the need for effective monitoring will only increase. In the future, we can expect to see the following trends in open-source ML monitoring:

  • Increased use of automated monitoring tools.
  • More focus on model explainability.
  • Greater collaboration between ML practitioners and data scientists.
  • Development of new standards for ML monitoring.

These trends will help to ensure that open-source ML models are more reliable, accurate, calibrated properly, robust, secure, and almost free from bias as new techniques to monitor foundation models are developed.


Monitoring open-source ML models is not just a matter of good practice; it’s a necessity for ensuring optimal performance, reliability, and compliance. As the adoption of open-source models continues to surge, organizations must embrace effective monitoring strategies to safeguard their investments and gain a competitive edge.

By setting up a comprehensive monitoring system, defining clear goals and metrics, and utilizing advanced evaluation techniques, organizations can stay ahead of potential pitfalls and maximize the value of their models.

The future of monitoring open-source ML models holds even more exciting possibilities. Advancements in holistic evaluation methods like HELM and the emergence of MLOps tools designed specifically for foundation models promise to enhance the monitoring process further. Organizations that embrace these trends will be well-positioned to unlock the full potential of open-source models and drive innovation in their respective fields.

In a rapidly evolving landscape where open-source models are at the forefront of AI advancements, monitoring is no longer an option; it’s an essential ingredient for success.


How to Monitor Open-source ML Models

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Recent Blog Posts

Precision vs. Recall in the Quest for Model Mastery
Precision vs. Recall in the Quest for Model Mastery

Webinar Event
The Best LLM Safety-Net to Date:
Deepchecks, Garak, and NeMo Guardrails 🚀
June 18th, 2024    8:00 AM PST

Register NowRegister Now