DEEPCHECKS GLOSSARY

DeepEval

The recent surge of Large-Language Model (LLM) applications like GPT-4o, MidJourney, and Sora, which generate text and create realistic images and videos, is transforming human-machine interaction and various industries. With this growing adaptation, evaluating LLMs for effectiveness, reliability, and ethical standards is crucial to ensure their safe, secure, and responsible use. DeepEval simplifies this evaluation process by providing standardized methods, metrics, and benchmarks to measure and enhance LLM performance, reliability, and fairness.

What is DeepEval?

DeepEval is an open-source framework designed to comprehensively evaluate LLMs. It goes beyond traditional metrics to incorporate a wide array of evaluation techniques, ensuring a holistic assessment of LLM performance. DeepEval employs a modular architecture to “unit test” LLM outputs similar to Pytest, allowing for flexible and customizable evaluation protocols tailored to specific needs and contexts.

An ideal evaluation workflow using deepeval is shown below:

DeepEval

Features of DeepEval

1. Modular design:

DeepEval’s modular design allows users to customize their evaluation pipelines. This flexibility ensures the framework can adapt to various LLM architectures and application domains.

2. Comprehensive metrics:

The framework offers a collection of plug-and-use metrics, with over 14 LLM-evaluated metrics backed by research. These metrics cover a wide range of use cases, from basic performance indicators to advanced measures of coherence, relevance, faithfulness, hallucination, toxicity, bias, summarization, and contextual understanding. Additionally, DeepEval offers the flexibility to customize metrics to suit your specific needs.

3. Benchmarks:

DeepEval offers state-of-the-art, research-backed benchmarks such as HellaSwag, MMLU, HumanEval, and GSM8K, providing standardized ways to measure LLM performance across various tasks.

4. Synthetic data generator:

Creating comprehensive evaluation datasets is challenging. DeepEval includes a data synthesizer that uses an LLM to generate and evolve inputs, creating complex and realistic datasets for diverse use cases.

5. Real-time and continuous evaluation:

DeepEval integrates with Confident AI for continuous evaluation, refining LLMs over time. This integration enables continuous evaluation in production, centralized cloud-based datasets, tools for tracing and debugging, evaluation history tracking, and summary report generation for stakeholders.

Deepchecks For LLM VALIDATION

DeepEval

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

DeepEval in Action

Below are the main steps to run an evaluation using DeepEval:

1. Install DeepEval (preferably in a virtual environment):

python3 -m venv venv
source venv/bin/activate

pip install -U deepeval

2. Create a test file (prefix with “test”), e.g., test_example.py:

Below is a sample test case to check the answer relevancy of the gpt-3.5-turbo model, saved in a file named test_example.py:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from openai import OpenAI
import os

# Set the OPENAI_API_KEY environment variable
os.environ['OPENAI_API_KEY'] = 'API KEY'
client = OpenAI()

#Obtain a chat completion response from gpt-3.5-turbo
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant in a shoe shop."},
    {"role": "user", "content": "What if these shoes don't fit?"}
  ]
)

# Get the actual output
actual_output = response.choices[0].message.content
print("Response from GPT3.5 Turbo: ", actual_output)

#Check the relevancy of the GPT 3.5 answer based on a threshold of 0.8
def test_answer_relevancy():
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.8)
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Actual output of your LLM application
        actual_output= actual_output,
    # Expected output
        expected_output = 'We offer a 30-day full refund at no extra cost.'
    )
    assert_test(test_case, [answer_relevancy_metric])

3. Run the test:

deepeval test run test_example.py

4. Check results:

Successful completion of running the test suite will generate an output containing a detailed explanation of the metric and its score:

Check results

Refer to the DeepEval documentation for more on customizing and using test cases and benchmarks.