Introduction
Large Language Models (LLMs) models like ChatGPT and GPT-4 have revolutionized the AI industry from simple predictive tools to offering state-of-the-art solutions. They augment human capabilities by automating processes, saving time and money, enhancing personalization, and enabling informed decision-making.
Their popularity is evident from a recent survey suggesting that over 50% of data scientists have plans to deploy LLMs in production in the coming year.
However, assessing the viability of LLMs for particular use cases requires robust monitoring and evaluation techniques.
As such, this post will discuss the significance and challenges of LLM monitoring, describe the LLM application lifecycle, recommend a framework for evaluating applications of LLMs in each lifecycle stage, and discuss best practices to build a practical LLM evaluation framework.
Importance of Monitoring & Evaluating LLMs
Although LLMs show great accuracy when solving general language problems, they still require robust evaluation to perform similarly on a domain-specific task. For instance, GPT-3 may be excellent in answering general questions regarding everyday life. However, a chatbot that uses GPT-3 to answer customer queries regarding products in a particular retail store may fail to meet expectations.
As such, monitoring LLM-based applications is crucial for the following reasons.
- Performance Improvement: Monitoring LLM applications in real-time can help businesses quickly identify issues, such as latency, longer response times, capacity constraints when processing many queries, and service downtime. Dedicated teams can use the information to make the system more efficient and ensure the application performs well over these dimensions.
- Error Detection: Regular monitoring and evaluation allow experts to identify critical errors such as incorrect answers, inappropriate language, hallucinations where LLMs tend to give completely nonsensical and unrealistic responses, biased information, etc. These errors can severely degrade performance for applications like customer service chatbots and content generation assistants, where accuracy and relevance are paramount. Monitoring response logs can help businesses overcome such issues.
- Resource Utilization: Monitoring usage statistics allows companies to evaluate how LLM applications utilize computing power and helps them allocate resources more efficiently to address traffic spikes. It also lets them get insights into how customers use the application and paves the way for further enhancements to ensure it gives customers what they want.
- Model Drift Detection: Data for training and fine-tuning an LLM application can differ from real-world data over time. For instance, an LLM-based chatbot may give the same responses to various novel questions from new customers. The behavior calls for retraining the model on new data to ensure it adapts to new information. Monitoring LLM behavior through customer feedback and reviews allows businesses to identify potential model drift promptly and helps them proactively curate the latest training data.
- Scalability: Constant monitoring helps companies identify workload patterns and use insights from historical data to plan for future upgrades. It helps optimize the infrastructure more efficiently to meet rising demands and minimize service downtime.
- Regulatory Compliance: Ensuring LLMs provide unbiased responses to all user queries and use appropriate language to address sensitive issues is significantly challenging. Organizations must establish robust safeguards and protocols to prevent security breaches and continually monitor LLMs to identify harmful responses, misinformation, and jailbreaks.
- Development and Inference Cost Optimization: Monitoring LLM-based applications allows businesses to economize operating and maintenance costs. It helps them identify cost-reduction opportunities and build efficient data storage systems for faster inference during production.
Challenges in Monitoring & Evaluating LLMs
While monitoring and evaluating LLMs is highly significant, organizations face several challenges when devising a comprehensive LLM evaluation framework. The section below mentions a few prominent LLM evaluation issues.
No single Ground Truth
LLMs use vast unstructured datasets for training and generate several responses to different questions. As such, it’s challenging to establish if a particular answer to a user’s query is correct.
For example, asking a language model to describe the image below can generate multiple responses.
“A cat and a table” and “There’s a table with a cat looking over it” both represent what’s in the image. But how should one judge which answer is more accurate?
In addition, the responses rely heavily on prompt quality. A user giving a highly ambiguous prompt and expecting a clear-cut answer can lead to disappointment and a negative user rating. But can we conclude from the feedback that the LLM isn’t working well and start retraining the application from scratch?
The examples above demonstrate the difficulty of developing an objective evaluation strategy. An alternative would be to compare the responses to several human-generated responses. However, such a technique is infeasible since there can be thousands of different ways of answering a single question. Deciding which one to use as ground truth can lead to biased training.
Nascent Tools
Limited tools are available for evaluating LLM applications, making the task more difficult as organizations must develop strategies in-house to assess their application quality.
Also, LLM applications are domain-specific, requiring specialized evaluation techniques instead of standardized approaches that automated tools may provide. Finding a tailor-made or customizable evaluation solution is challenging.
Automated Metrics are Limited in Scope
Automated evaluation metrics such as BLEU, ROUGE, MUAVE, etc., do not account for context when assessing response quality.
Most metrics give their judgments by comparing the output with human-generated responses, assigning high scores to models that match perfectly with specific human-level responses.
Such an approach provides a narrow way to measure quality since it disregards user intent, context, and excessive variability in ground truths.
The LLM Application Development Lifecycle
Although LLM evaluation is tricky, organizations can still devise sound methods by referring to the LLM application lifecycle. The lifecycle stages mentioned below can help businesses develop monitoring and evaluation techniques tailored to each stage for better results.
- The Pre-Training Stage: Foundation LLMs use large amounts of generic data for training. In this stage, an LLM must learn common patterns in human language and generate human-like responses.
- The Reinforcement Stage: AI experts can introduce a reward and punishment model during training. The task of such models is to guide an LLM to generate the desired output according to particular feedback. The LLM gets rewarded if the output matches feedback and penalized if it doesn’t.
- The Fine-Tuning Stage: Next, developers fine-tune the LLM by training it on domain-specific data. The stage ensures the model learns to perform specific tasks more accurately.
- The Integration Stage: Several other systems, such as databases, application programming interfaces (APIs), and platforms, integrate with the LLM platform for enhanced functionality.
- The Testing Stage: Experts test the LLM application by giving it several prompts and observing response quality.
- The Production Stage: Finally, developers send the application to production and monitor real-time usage to identify issues.
How to Evaluate LLM at Each Stage of its Lifecycle
AI practitioners can start by identifying what they must monitor to evaluate LLMs. Below are a few standard dimensions.
- Toxicity: A metric measuring toxicity observes the type of language used in user prompts and LLM responses. Suitable safeguards must be in place to ensure offensive and inappropriate output doesn’t harm LLM performance.
- Hallucinations: An LLM hallucinates when it gives entirely incorrect and nonsensical responses, detached from reality. Businesses must develop metrics that can tell when LLMs are hallucinating.
- Jailbreaking Prompts: Constant monitoring and prompt logging is necessary to identify jailbreaks. A jailbreak occurs when a user prompt releases the LLM from its security constraints. For instance, a user can give a DAN (do anything now) prompt, which asks the LLM to pretend to be someone who doesn’t regard security protocols and gives completely open answers.
- Relevance: Developers must create metrics that tell whether an LLM is giving answers relevant to the user’s intent and context.
- Sentiment Changes: Monitoring sentiment changes during a conversation can provide great insights into LLM performance. For instance, a user can show frustration by providing prompts that use language exhibiting anger or annoyance. The information will indicate the LLM isn’t helping the user.
- Topic Categorization: Categorizing prompts into relevant topics can help LLMs give more targeted responses. Measuring such categorization can help improve response quality.
Automated and Custom Metrics
Once the developers decide what metrics to track, they can refer to the LLM lifecycle and use automated or custom metrics to measure the above factors.
Automated metrics are helpful during the initial two stages, as the objective is to see whether LLMs perform well on generic tasks. Custom metrics can help assess the quality in later stages, which are more use-case-specific.
Mainstream automated LLM evaluation metrics include:
- BLEU: Bilingual Evaluation Understudy (BLEU) is an n-gram-based matching algorithm mostly used for evaluating LLM translations. It measures precision by checking how many words in the generated output appear in a reference human-generated sentence.
- ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is similar to BLEU, with the only difference being that ROUGE measures recall. It checks how many words in the human-generated sentence appear in the machine-generated output.
- MUAVE: MUAVE measures the difference between the distributions of machine-generated texts and human-generated samples. It uses a broader sample to match several outputs instead of one-to-one matching like BLEU or ROUGE.
- BERTScore: Like BLEU and ROUGE, BERTScore measures recall and precision. However, it converts the machine output and human-level text into embeddings and computes their similarity.
Custom metrics are use-case specific and can include the following:
- Users’ Reviews: Organizations can analyze user reviews and use sentiment analysis to measure how users feel about the service. A positive sentiment score indicates the application is doing well.
- Response Relevance: AI experts can develop a similarity metric based on embeddings that measure how close LLM responses are to particular user prompts.
- Number of Queries Resolved in the First Attempt: A quick way to evaluate chatbots is to see how many queries they resolve without requiring the customer to follow up. Built-in methods where chatbots ask users questions with a yes/no answer can be helpful. For instance, the chatbot can ask, “Did you find what you were looking for?” and the user can click either “Yes” or “No” to provide feedback.
Using Service-Level Objectives (SLOs)
An SLO is a standard that an application must meet to maintain a high level of performance. For example, an SLO for a customer care chatbot might be that it resolves 80% of user queries monthly.
Organizations can formalize the evaluation process using such SLOs to ensure consistency across metrics and encourage developers to constantly improve the application by fulfilling established SLOs.
Below are a few examples that show how organizations can apply SLOs in several use cases.
- SLOs for Recommendation Systems: An SLO stating that a recommendation system must have an overall 90% accuracy will ensure users get suitable recommendations 90% of the time. Experts can track the SLO using metrics such as click-through rates and user reviews.
- SLOs for Code Generators: A relevant SLO for code generators can be to develop code that complies with a company’s coding standards 90% of the time. Practitioners can measure this by checking the proportion of code that passes style and documentation checks.
- SLOs for Text Summarizations: An SLO for text summarization can be to generate 90% of summaries within one minute. Experts can track the time an LLM takes to generate summaries and measure how many it develops within the time limit.
- SLOs for Information Retrieval Systems: SLOs for such systems can generate relevant results 90% of the time. Developers can measure relevance using precision and recall measures by classifying a result as relevant or irrelevant.
Best Practices
Selecting the right evaluation strategy is challenging and involves considering several factors to ensure LLMs work according to users’ expectations in specific domains.
However, organizations can follow certain best practices to establish a holistic evaluation plan for a robust LLM application.
Such best practices can be as follows:
- Determine the Goal of LLM Application: Clearly defining goals is the most critical aspect in developing any strategy. What is it that you want to achieve with the application? Answering the question will help you create realistic metrics for tracking performance.
- Decide the Right Metrics for Monitoring LLMs: Once you know your goals and objectives, the next step is establishing measurable metrics that capture the application’s quality along the dimensions mentioned earlier.
- Establish Alert Mechanisms for Prompt Notifications: Systems should be in place that notify relevant teams when metrics approach critical values or breach thresholds. The technique allows for quick recovery before things start going out of hand.
- Establish Safeguards: Organizations must develop monitoring methods to track harmful prompts and block offensive or misleading responses.
- Build Scalable Applications: Automating workflows and using cloud-based platforms is advisable to keep your system flexible.
Build LLM-Based Applications With Continuous Monitoring & EvaluationÂ
As the complexity of LLMs evolves to handle more extended problems, evaluation becomes more challenging as no single metric provides a complete solution to assess the quality of LLM applications.
However, Deepchecks’ LLM monitoring platform is a state-of-the-art tool that offers novel features to assess the quality of LLM applications from pre-deployment to production. It monitors real-time factors such as correctness, bias, robustness, etc., to ensure optimal performance and regulatory compliance.
So, try the evaluation platform now to boost your application’s performance.