Deepchecks LLM Evaluation
For RAG Apps

Everything you need for testing and evaluating your large language model based applications. Build
production-ready Retrieval-Augmented Generation (RAG) applications.
Test Different Options Objectively

Test Different Options
Objectively

Evaluate your RAG app holistically and test different components to find the best combination.

Test different:

  • LLM models
  • Different prompts
  • Different chunking strategies
  • Embedding models
  • Retrieval methods
  • Make product decisions and vendor selection
    metric-driven.

Understand How Each Step
Can Be Improved

Grounded in Context

Measures of how well the LLM output is
grounded in the context of the question.

Retrieval Relevance

Deepchecks CI is all about running the test suites you know (and love) as part of your CI/CD, using tools like GitHub Actions or Airflow.

Correctness

Deepchecks Monitoring is all about tracking data and models in production to make sure your ML system is behaving as expected.

Grounded in Context
Retrieval Relevance
Correctness
Automatic Annotation & Scoring

Automatic Annotation &
Scoring

Deepchecks automatically annotates LLM
interactions using a combination of open-
source, proprietary, and LLM models

Easily configure the out-of-the-box scoring to
further improve accuracy.

  • Choose the properties to prioritize
  • Refine properties and similarity thresholds
  • Change the location of each step in the pipeline
Monitoring

Mine Hard Samples for
Fine-Tuning & Debugging

Easily extract edge cases or sets of samples
where your RAG application doesn’t perform
well. Use it to modify the code or prompt, or just
download it for the next iteration of re-training.

LLMOps.Space LLMOps.Space

Deepchecks is a founding member of LLMOps.Space, a global community for LLM
practitioners. The community focuses on LLMOps-related content, discussions, and
events. Join thousands of practitioners on our Discord.
Join Discord ServerJoin Discord Server

LLMOps Past Events

Config-Driven Development for LLMs: Versioning, Routing, & Evaluating LLMs
Config-Driven Development for LLMs: Versioning, Routing, & Evaluating LLMs
Fine-tuning LLMs with Hugging Face SFT πŸ€—
Fine-tuning LLMs with Hugging Face SFT πŸ€—
The Science of LLM Benchmarks: Methods, Metrics, and Meanings
The Science of LLM Benchmarks: Methods, Metrics, and Meanings πŸš€

Featured Content

LLM Evaluation: When Should I Start?
LLM Evaluation: When Should I Start?
How to Build, Evaluate, and Manage Prompts for LLM
How to Build, Evaluate, and Manage Prompts for LLM