LLM Evaluation
Deepchecks LLM Evaluation

Test, Evaluate,
& Monitor LLM apps

End-to-end evaluation and observability solution to
support all parts of your workflow.
  • Mitigates Hallucinations
  • Full lifecycle Support
  • Automated Evaluation

Evaluation Components

LLM apps evaluation is complex, it requires a holistic set of capabilities that will help you with getting the job done.

Automated Scoring

Get well-defined metrics for each aspect of your LLM-based apps
Learn More
Automated Scoring

Version Comparison

Evaluate your LLM-based app & test different components to find the best combination
Learn More
Version Comparison

Properties

Check each aspect of your LLM-based app using custom and off-the-shelf properties
Learn More
Properties

Golden Set Management

Build and expand the set of interactions for version and experiment comparison
Learn More
Golden Set Management

Monitoring

Apply rigorous checks to ensure your LLMs consistently deliver optimal performance
Learn More
Monitoring

Debugging

Get to the root cause with filtering and drill down to every application step
Learn More
Debugging
Automated Scoring

Automated Annotation
with Manual Override

Get both manual and automated annotations to
evaluate all the interactions with the LLM.

Create a ground truth with manual annotations
and fine-tune your automatic annotation pipeline
to provide more accurate results.

Version Comparison
Version Comparison

Compare Experiments and
Pre-Production Versions

Experiment with different:

  • LLMs
  • Vector databases
  • Knowledge sources
  • Embedding models
  • Retrieval methods

Make product decisions and vendor selection
metric-driven.

Properties

Understand How Each Step
Can Be Improved

Rigorously check each aspect of your LLM-
powered application using Deepchecks’s
custom and off-the-shelf properties

  • Quality Metrics
  • Safety Metrics
  • LLM Metrics
  • Quantitative Metrics
  • User-defined metrics
Golden Set Management
Golden Set Management

Build, Expand & Explore
Your Golden Set

“Golden Set” is like the “test set” from classic ML,
adapted to benchmark generative applications.

  • Explore your annotated responses to learn
    what is & isn’t working
  • Generate or expand your Golden Set with
    Deepchecks LLM Evaluation
Monitoring

LLM Apps Production
Monitoring

LLM applications require much more than just
input and output format validation.

Hallucinations, harmful content, model
performance degradation, or a broken data
pipeline are common problems that may arise
over time.

Debugging

Debugging and Root-
Cause Analysis

Understand methodically where your problems
lie within your LLM application.

  • Automatically identifies your weakest
    segments
  • Manually segment your data to identify more
    weak segments
  • See all the detailed steps of our LLM app to
    find the one that failed

LLMOps.Space LLMOps.Space

Deepchecks is a founding member of LLMOps.Space, a global community for LLM
practitioners. The community focuses on LLMOps-related content, discussions, and
events. Join thousands of practitioners on our Discord.
Join Discord ServerJoin Discord Server

Past Events

LLM Application Observability | Deepchecks Evaluation
LLM Application Observability | Deepchecks Evaluation
Config-Driven Development for LLMs: Versioning, Routing, & Evaluating LLMs
Config-Driven Development for LLMs: Versioning, Routing, & Evaluating LLMs
Fine-tuning LLMs with Hugging Face SFT 🤗
Fine-tuning LLMs with Hugging Face SFT 🤗

Featured Content

LLM Evaluation: When Should I Start?
LLM Evaluation: When Should I Start?