Deepchecks’ New Major Release:
Evaluation for LLM-Based Apps

Introduction

At Deepchecks, we’ve built a pretty special solution for LLM Evaluation that we’ll be exposing in just under two weeks. In this post, I’ll share a bit about our journey leading up to this, and include some thoughts about LLMs and where the space is going.

Some of you may have noticed that at Deepchecks we’ve been putting the LLM use case in a more and more central place. We’ve put LLMs front and center on our website, founded the LLMOps.Space community that’s been having an event every 1–2 weeks, and have been talking & writing about this more than any other use case.

What you probably haven’t noticed is that for the last couple of months, we’ve been working closely with beta partners on our new module for LLM Evaluation. And now finally, on November 28th, we’ll be publicly releasing it at this LLMOps.Space event, hope to see you there (sign up 👉here👈)!

Product Launch Event: Sign up here

Personally, I’m extremely excited about this launch and am writing this blog post to share my personal perspective from our journey leading up to this.

Deepchecks Background

Since the launch of our open-source package in January 2022 for testing ML models, we’ve seen an overwhelmingly positive response, with over 3,000 GitHub stars and more than 900,000 downloads. The growing usage combined with pretty clear feature requests from the community encouraged us to expand our offering from the testing of tabular data to multiple different areas:

  1. Monitoring of models in production, including an open-source version
  2. Testing module for Computer Vision
  3. Testing module for NLP

Illustration of Deepchecks’ NLP Testing module, that was open-sourced in June

The module for NLP was released last June (see here), but even before its release, we noticed that an incredible amount of the feedback calls we were having about the package were asking for help with evaluating LLM-based apps. After creating an initial POC (based on Streamlit) and getting feedback from various companies, we gained the confidence we needed to dive deeply into the LLM Evaluation space. And yes, turns out it’s a pretty big deal.

What Makes This Interesting?

LLMs Are Changing the Tech World at a Faster Pace Then Anticipated

As you can understand from our history, we’ve made it our mission to make a dent in the way AI systems are validated from pretty much day 1 of Deepchecks. When we set out to achieve this goal almost 4 years ago, some folks thought we were crazy. We got asked questions like:

  • Isn’t AI just a buzzword?
  • Do teams actually build this or just talk about it?

Little did they (or we) know about the AI/LLM revolution, that would so quickly turn these systems from an early adopter niche to a major part of so many of our lives. ChatGPT was launched just under a year ago, and it’s already almost unimaginable to think about the tech world without this.

According to Gartner, by 2024, 40% of enterprise applications will have embedded conversational AI, up from less than 5% in 2020. And from a pretty large sample of chats with the folks building these LLM-based apps, it sounds like evaluating them is one of the major obstacles. Talk about having a good answer to the “why now” question!

So What’s Different About Testing & Evaluating LLMs?

As you can guess, my team and I have accumulated some knowledge about testing AI & ML while building the modules listed above. However, as we began working on the LLM Evaluation module, we’ve arrived at some important learnings:

  1. Teams are working on LLM Evaluation to answer 2 similar but pretty different questions:
    Is it good (accuracy, relevance, usefulness, grounded in context, etc.)
    Is it not bad (bias, toxicity, PII leakage, straying from company policy, etc.)
    While both of these aspects did exist in one way or another even before the LLM revolution, now these are much more intertwined and much more common that a company or team needs both of them.
  2. The concept of a “test set”, that’s a fundamental term in non-generative ML, doesn’t really exist in the same manner. This is due to the fact that there can be multiple correct responses for a single input. This leads to a few possible approaches, while one of the most common ones is working with a fixed set of inputs that we’re curated by the team (i.e. golden set).
  3. The user is a bit different. While data scientists, machine learning engineers and software developers are all still involved in model quality, we’ve reached the conclusion that in most cases they won’t be the primary quality owners of LLM-based apps over time. For the LLM use case, new users like data curation, product managers and business analysts should have independent abilities to modify policies, modify prompts, compare versions, and more.
  4. Phases are a bit different then with “classic ML”. The phases in most textbooks include training, testing, validation & production. For LLM-based apps, training is usually not done by the same team that’s building the app, and we can look at the phases a bit differently: Experimentation/development, staging/beta testing, and production.

So how would you go about building an offering taking these learnings into account? At Deepchecks it was quite an effort, and for us — we thought they were significant enough to create a whole seperate module & UX for LLMs rather than to try to expand our existing solutions.

Want To See What We’ve Built?

Sorry, can’t expose that here — at least not yet.
But you don’t have to wait so long for it. Launch date is in less than 2 weeks, on November 28th.

Sign up here!

And hope to see you there 🎉😊

Product Launch Event: Sign up here. Yes I know it also appeared above. You have an amazing attention span and I knew you would notice this, I just did it for some of your colleagues who may have been paying less attention.

Philip Tannor is the co-founder and CEO of deepchecks, an open-source-led company for continuously validating AI systems. Philip has a rich background in AI/ML and has experience with projects including NLP, image processing, time series, signal processing, and more. Philip holds an M.Sc. in Electrical Engineering, and a B.Sc. in Physics and Mathematics, although he barely remembers anything from his studies that doesn’t relate to computer science or algorithms.

(OK, my father, that’s a Quantum Mechanics Professor, requested that I clarify that that’s just a joke. I do like AI more than physics 🤗, but you can still ask me about Bernoulli’s principle or Maxwell’s equations. Just don’t surprise me on a podcast!)

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo