LLM Evaluation With Deepchecks & Vertex AI

Introduction

This blogpost walks you through an integration between Google VertexAI and Deepchecks LLM evaluation platform, by using a custom prediction routine server deployed on Google VertexAI. This integration empowers researchers and developers with the ability to continuously evaluate, monitor, and refine their LLM models directly within their VertexAI environment.

Deepchecks integration

Vertex AI + Deepchecks integration

Why Is This Interesting?

  • You may have seen that Gemini is beginning to surpass GPT-3.5 & GPT-4 on some of the key metrics (see for example here), and also has some really special characteristics such as a very long context window.
  • If you’ve been closely following the latest updates in the space, you may have concluded that now is a good time to start experimenting with other approaches.
  • Deepchecks specializes in evaluating LLM-based apps, enabling robust automatic scoring, version comparison, and auto-calculated metrics for properties like relevance and grounded-in-context. Its standout feature is the ability to accurately estimate the expected annotation for LLM responses, surpassing common methods like GPT-as-a-judge. The solution is designed to enable both LLM practitioners and non-technical users to iterate quickly while taking full control of both quality and compliance.
  • Trying out the combination of both solutions together can be a great learning experience, and includes learning about Gemini, LLM Evaluation, and potentially a fine-grained comparison of Gemini with other models.

So Let’s Get Started!

This guide dissects the Deepchecks-VertexAI integration, walking you through how to:

  • Craft a custom prediction server on VertexAI, with streamlined integration to Deepchecks.
  • Direct predictions from the VertexAI endpoint to Deepchecks for immediate feedback.
  • Run evaluations and capture insightful results on an LLM Summarization use case.
  • Analyze LLM performance with Deepchecks’ properties, topics, and auto-annotations.
Deepchecks LLM Evaluation Platform

Deepchecks LLM Evaluation Platform

Prerequisites

  • Active Google Cloud Account: Ensure you have an active Google Cloud account, or start a free trial.
  • Google Cloud Project for VertexAI: Use an existing project or create a new one to utilize VertexAI services.
  • Python Environment: Prepare your preferred Python environment (e.g. virtualenv) for coding.
  • Deepchecks Account: Ensure you have an active Deepchecks account, with a working API key in place. Request a free trial here if needed.

Prerequisites

Setting up the Integration

We will follow the guide in VertexAI Documentation to create a custom prediction routine. A custom prediction routine will allow us to control the traffic sent from and to the LLM.

Let’s start by creating a requirements.txt file:

%%writefile requirements.txt
fastapi
uvicorn==0.17.6
joblib~=1.0
numpy~=1.20
scikit-learn~=0.24
pandas
google-cloud-storage>=1.26.0,<2.0.0dev
google-cloud-aiplatform[prediction]>=1.16.0

Pip install the dependencies in the notebook.

!pip install -U --user -r requirements.txt

Next, create the required directories

USER_SRC_DIR = "src_dir"
!mkdir $USER_SRC_DIR
# copy the requirements to the source dir
!cp requirements.txt $USER_SRC_DIR/requirements.txt

Create a .env file with the required tokens, that will be mounted automatically using python-dotenv package:

%%writefile $USER_SRC_DIR/.env

HOST=https://app.llm.deepchecks.com
API_TOKEN=<DEEPCHECKS_API_TOKEN>
APP_NAME=<DEEPCHECK_APP>
VERSION_NAME=<DEEPCHECKS_VERSION_NAME>
GCP_PROJECT_ID=<GCP_PROJECT_ID>
Deepchecks For LLM VALIDATION

LLM Evaluation With Deepchecks & Vertex AI

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Now, that everything is in place, it’s time to write the Predictor class, that will handle the logic of the custom prediction routine. We will implement the Predictor class to call the Gemini-pro model and send every input & output to the Deepchecks LLM Evaluation app.

%%writefile $USER_SRC_DIR/predictor.py

import os
from abc import ABC, abstractmethod
from typing import Any
import logging
from deepchecks_llm_client.client import dc_client
from deepchecks_llm_client.data_types import EnvType, Tag, AnnotationType, StepType
logger = logging.getLogger('gemini_router')
logger.setLevel(logging.INFO)
import dotenv
dotenv.load_dotenv(override=True)
class Predictor(ABC):
  """Interface of the Predictor class for Custom 
  Prediction Routines. The Predictor is responsible 
  for the ML logic for processing a prediction request.
  Specifically, the Predictor must define:
  (1) How to load all model artifacts used during prediction 
      into memory.
  (2) The logic that should be executed at predict time.
  When using the default PredictionHandler, the Predictor will 
  be invoked as follows:
  predictor.postprocess(predictor.predict(predictor.preprocess(\
    prediction_input)))
  """

  def load(self, artifacts_uri: str) -> None:
    """Loads the model artifact.
    Args:
        artifacts_uri (str):
            Required. The value of the environment variable 
            AIP_STORAGE_URI.
    """
    pass
  def preprocess(self, prediction_input: Any) -> Any:
    """Preprocesses the prediction input before doing the 
       prediction.
    Args:
        prediction_input (Any):
            Required. The prediction input that needs to be 
            preprocessed.
    Returns:
        The preprocessed prediction input.
    """
    return prediction_input["instances"]

  def predict(self, instances: Any) -> Any:
    """Performs prediction.
    Args:
        instances (Any):
            Required. The instance(s) used for performing 
            prediction.
    Returns:
        Prediction results.
    """
    try:
      import vertexai
      from vertexai.preview.generative_models import GenerativeModel, Part
      dc_client.init(
        host=os.getenv('HOST'),
        api_token=os.getenv('API_TOKEN'),
        app_name=os.getenv('APP_NAME'),
        version_name=os.getenv("VERSION_NAME"),
        env_type=EnvType.EVAL,
        auto_collect=False,
        log_level=logging.DEBUG)
    except ImportError as e:
      logger.error(f"invoke() Failed to import vertexai")
      return
    project_id = os.getenv("GCP_PROJECT_ID")
    completions = []
    if project_id:
      location = "us-central1"
      vertexai.init(project=project_id, location=location)
      gen_model = GenerativeModel('gemini-pro')
      for prompt in instances:
        try:
          response = gen_model.generate_content(
            [
              f"""Summarize the main points of the passage.
                  \n passage: {prompt}"""
            ],
            generation_config={"max_output_tokens": 8192}
          )
          try:
            completion = str(response.text)
          except Exception:
            completion = "I'm sorry, I can't assist you with this request."
        except Exception:
          completion = ""
        dc_client.version_name(os.getenv("VERSION_NAME"))
        dc_client.log_interaction(
          input=prompt,
          output=completion
        )
        completions.append(completion)
    return completions
  def postprocess(self, prediction_results: Any) -> Any:
    """Postprocesses the prediction results.
    Args:
        prediction_results (Any):
            Required. The prediction results.
    Returns:
        The postprocessed prediction results.
    """
    return prediction_results

Let’s take a deeper look at the predict method that contains the main logic. It iterates over the prompts and simply sends each prompt for completion using VertexAI Gemini API, then sends the results to Deepchecks and resumes the prediction pipeline.

Next, use the Vertex AI Python SDK to build the image. Using custom prediction routines, the Dockerfile will be generated and an image will be built for you.

from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location='us-central1')

import os

from google.cloud.aiplatform.prediction import LocalModel
from src_dir.predictor import Predictor

local_model = LocalModel.build_cpr_model(
    USER_SRC_DIR,
    f"us-central1-docker.pkg.dev/{os.getenv('GCP_PROJECT_ID')}/gemini-pro-deepchecks/deepchecks-gemini",
    predictor=Predictor,
    requirements_path=os.path.join(os.curdir, 'src_dir', 'requirements.txt'),
)

Let’s test the container locally, before deploying it to Google VertexAI endpoint:

import json

with local_model.deploy_to_local_endpoint(
    os.curdir
) as local_endpoint:
    predict_response = local_endpoint.predict(
        request=json.dumps({"instances": ["Hello, how are you?"]}),
        headers={"Content-Type": "application/json"},
    )
    health_check_response = local_endpoint.run_health_check()
    print(predict_response.content)
    print(health_check_response.context)

Now that you’ve tested the container locally, it’s time to push the image to Artifact Registry and upload the model to Vertex AI Model Registry.

First, configure Docker to access Artifact Registry.

!gcloud artifacts repositories create {REPOSITORY} --repository-format=docker \
--location=us-central1 --description="Docker repository"

!gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet

Now, push the image

local_model.push_image()

And upload the model

model = aiplatform.Model.upload(local_model = local_model,
                                display_name=MODEL_DISPLAY_NAME)

When the model is uploaded, you should see it in the console:

Setting up the Integration

Now, we can deploy the model to an endpoint to support online predictions:

endpoint = model.deploy(machine_type="n1-standard-2")

passages = [
  """Esteban Cambiasso has won all the major European competitions a player can during his illustrious career but revealed that keeping Leicester City in the Premier League would be up there with the best. The Foxes are currently seven points adrift at the bottom of the table, with only eight games remaining, knowing that time is running out to save themselves. Cambiasso refuses to give up and admits that keeping Leicester up will feel like winning a trophy. Esteban Cambiasso says that helping keep Leicester in the Premier League will feel like winning a trophy 'For me, it's like another cup,' he told BBC East Midlands Today. 'When you start another season you have an objective, and this is the objective for us. 'For me, winning a cup or winning the league with another team is the same now as having the possibility to save Leicester in the Premier League.' The Argentinian midfielder poses with the trophy after his team won the 2010 FIFA Club World Cup Cambiasso had an illustrious career at Inter Milan, winning an impressive 15 trophies during his stint River Plate (2001-2002) Argentine Primera Division Real Madrid (2002-2004) La Liga Super Cup Supercopa de Espana Inter Milan (2004-2014) Champions League Serie A (5) Coppa Italia (4) Supercoppa (4) FIFA Club World Cup Having not won a game since January, Nigel Pearson's men face West Ham United on Saturday and Cambiasso is still convinced they can avoid the drop. 'I understood when I signed for Leicester it's not an easy job to stay in the Premier League,' he said. 'It's a difficult situation but I think we have our chances to win matches. There's a quarter of the Premier League left to finish. 'I think some people think for Leicester all is finished. But I'm sure, because I watch my team-mates every day, we can save Leicester and stay in the Premier League.' The former Inter Milan star signed for the Foxes in the summer, leaving Italy after ten years and embarking on a new challenge in England. After agreeing to a one-year-deal, Cambiasso has quickly established himself as a key player but it remains to be seen if he'll still be in the East Midlands at the start of next season. The former Real Madrid man was also successful during his short spell in Spain for Real Madrid Cambiasso played during Real's 'Galatico' era, with Luis Figo, Zinedine Zidane, Ronaldo and David Beckham 'Leicester always wanted me,' he added. 'After these nine months or eight months, I'm very happy because my family is OK, and I'm good. 'I want a few more points, but all the rest is perfect.' Cambiasso is happy in the East Midlands and could stay beyond his current one-year-deal""",
  """ MLS side Orlando City are the latest club to have expressed interest in Manchester United misfit Javier Hernandez. The Mexico international would be a huge commercial draw for the Florida-based franchise who are coached by former Everton and Manchester City striker Adrian Heath. Orlando have a huge Latin-American fanbase and made enquiries last week about the prospect of a deal. Javier Hernandez is linked with a move to Orlando City after enduring a tough time on loan at Real Madrid Orlando have a big Latin-American fanbase and Kaka is the captain of the MLS side Hernandez would be a popular arrival with Orlando supporters but eight European sides are also interested Hernandez has cut a frustrated figure during his loan spell at Real Madrid this season but still has plenty of interest from other Premier League and European sides. Southampton, Stoke, West Ham and Everton are all interested with United willing to sell for around £8million. Wolfsburg, AC Milan, Lazio and Inter Milan are also keen on the 26-year-old who has one year left on contract. United, meanwhile, have made a revised contract offer to teenage prospect Andreas Pereira. Manchester United have made a revised contract offer to 19-year-old Andreas Pereira (right) Periera (left) has a host of clubs across Europe interested in signing him if he does not agree terms at United Paris St Germain, Juventus, PSV Eindhoven and Feyenoord have all made contact with the midfielder's father Marcos after the 19-year-old rejected United's opening offer. Pereira was on the bench against Aston Villa on Saturday.""",
  """(CNN)Donald Sterling's racist remarks cost him an NBA team last year. But now it's his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling's wife sued her. In the lawsuit, Rochelle \"Shelly\" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple's money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. \"This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress,\" attorney Pierce O'Donnell said in a statement. \"It also sets a precedent that the injured spouse can recover damages from the recipient of these ill-begotten gifts.\" Stiviano's gifts from Donald Sterling didn't just include uber-expensive items like luxury cars. According to the Los Angeles Times, the list also includes a $391 Easter bunny costume, a $299 two-speed blender and a $12 lace thong. Donald Sterling's downfall came after an audio recording surfaced of the octogenarian arguing with Stiviano. In the tape, Sterling chastises Stiviano for posting pictures on social media of her posing with African-Americans, including basketball legend Magic Johnson. \"In your lousy f**ing Instagrams, you don't have to have yourself with -- walking with black people,\" Sterling said in the audio first posted by TMZ. He also tells Stiviano not to bring Johnson to Clippers games and not to post photos with the Hall of Famer so Sterling's friends can see. \"Admire him, bring him here, feed him, f**k him, but don't put (Magic) on an Instagram for the world to have to see so they have to call me,\" Sterling said. NBA Commissioner Adam Silver banned Sterling from the league, fined him $2.5 million and pushed through a charge to terminate all of his ownership rights in the franchise. Fact check: Donald Sterling's claims vs. reality CNN's Dottie Evans contributed to this report.""",
  """(CNN)A North Pacific gray whale has earned a spot in the record books after completing the longest migration of a mammal ever recorded. The whale, named Varvara, swam nearly 14,000 miles (22,500 kilometers), according to a release from Oregon State University, whose scientists helped conduct the whale-tracking study. Varvara, which is Russian for \"Barbara,\" left her primary feeding ground off Russia's Sakhalin Island to cross the Pacific Ocean and down the West Coast of the United States to Baja, Mexico. Varvara's journey surpassed a record listed on the Guinness Worlds Records website. It said the previous record was set by a humpback whale that swam a mere 10,190-mile round trip between the \"warm breeding waters near the equator and the colder food-rich waters of the Arctic and Antarctic regions.\" Records are nice, but Bruce Mate, the lead author of the study, thinks the long trip might say more about the whale than just its ability to swim. During her 14,000-mile journey, Varvara visited \"three major breeding areas for eastern gray whales,\" which was a surprise to Mate, who is also the director of the Marine Mammal Institute at Oregon State University. \"For her to go to Mexico,\" Mate said, \"It's pretty strong evidence that it's where she's from.\" Varvara was thought to be an endangered western whale, but her ability to \"navigate across open water over tremendously long distances is impressive,\" he said in the release, which could mean that some western gray whales are actually eastern grays. With only 150 western gray whales believed to be in existence, that number might be even lower. \"Past studies have indicated genetic differentiation between the species, but this suggests we may need to take a closer look,\" Mate said. Fourth baby orca born this season""",
  """ Usain Bolt will compete at the IAAF/BTC World Relays in the Bahamas next month, the Jamaica Athletics Administrative Association has announced. The six-time Olympic gold medallist will compete at the relay championship on May 2 and 3 as part of the Jamaican team. 'I'm happy to be part of the Jamaican team for the IAAF / BTC World Relays in the Bahamas. I am fit, healthy and ready to run,' said Bolt. Usain Bolt has confirmed he will be part of Jamaica's team at the World Relays in the Bahamas Bolt reacts as he wins 4x100m gold at the London Olympic Games in 2012 'I hear the meet was a lot of fun last year and there was a great atmosphere. Jamaica has a long and successful tradition in relays and when we put on the national colours we always do our best to make the country proud,' he added. JAAA General Secretary Garth Gayle commented, 'We were extremely pleased that Usain was available for selection and that the world's fastest man will be running for Jamaica. We can expect some sprint magic on the track in the Bahamas on 2nd and 3rd May.' The full Jamaican team list for the competition will be announced shortly. Bolt insists he always does 'his best to make his country proud' while wearing Jamaica colours"""
]
for p in passages:
  endpoint.predict(instances=[p])

And that’s it! Now you should have a working endpoint that is working seamlessly with Deepchecks.

Exploring the Results Within Deepchecks

Once the interactions have been sent to Deepchecks, we can explore the results.

In this example, 5 interactions were sent to Deepchecks’ UI, for which the system automatically calculated properties and auto-annotated the results of the LLM. We can see we had 3 good examples, and 2 bad examples (these annotations can be overridden by a human annotation).

Exploring the Results

The properties enable a fine-grained analysis of the performance of the LLM-based app in different aspects. For a full list of properties, you can check out Deepchecks docs: https://llmdocs.deepchecks.com/docs/concepts

When we dive into the data page, we can check out which interactions got a “good” score and which didn’t, and also filter or sort by any of the calculated properties:

Exploring the Results

There’s a nice explainability aspect to this screen, that includes presenting both the “estimated annotation” and the reason why the sample was considered good or bad.

Let’s have a look at a specific example:

We can see that the summary the LLM gave for this passage does not cover all aspects, thus Deepchecks gives it a score of 2 (out of 5) in the Coverage property and auto-annotated it as a bad interaction. What do you think, was the estimation correct in this case?

How to Try It Yourself

We hope this blog is detailed enough to get this combo of Deepchecks & Vertex AI up and running on your own if you have an account on both platforms. However, here is how to get started on each of the platforms if you haven’t yet:

  1. Vertex AI: Reach out to your Technical Account Manager at GCP and ask about VertexAI custom prediction servers, or set up a free account here
  2. Deepchecks: Request a free trial here or reach out via LinkedIn

And then, of course, follow this blog.
As usual, feel free to reach out to either team with any feedback or questions.
Happy LLM-ing!

Happy LLM-ing!

Deepchecks For LLM VALIDATION

LLM Evaluation With Deepchecks & Vertex AI

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Recent Blog Posts

LLM Evaluation: When Should I Start?
LLM Evaluation: When Should I Start?
How to Build, Evaluate, and Manage Prompts for LLM
How to Build, Evaluate, and Manage Prompts for LLM