The Best LLM Safety-Net to Date: Deepchecks, Garak, and NeMo Guardrails All in One Bundle

Introduction

AI is the new electricity,” said Andrew NG back in 2017. For a few years, people liked to say this but treated it as an overstatement, meant primarily to create buzz during lectures and discussions. But ever since the LLM revolution – I’m not so sure about that. LLMs rapidly changed the way we approach some of the toughest problems out there like summarization, content generation, and question-answering – lowering the barrier for building AI systems to virtually 0 in many cases.

However, LLM-based applications come with some significant risks: They can leak private information, use inappropriate language, and in some cases initiate actions that are flat-out malicious. Well, electricity came with the risk of being electrocuted – a risk Benjamin Franklin helped us all mitigate by introducing the idea of electrical grounding. Hopefully, over time we’ll arrive at an approach for mitigating LLM risks that feels just as elegant as electrical grounding.

AI is the new electricity

Generic illustration generated by… (You’ll never guess)… GPT-4!

In this blog post, we’ll introduce an approach to doing just that, i.e mitigating risks by an LLM-based system throughout the different phases of the lifecycle – by combing the upsides of several powerful tools:

  • Garak: An open-source stress-testing tool that identifies vulnerabilities and pushes LLMs to their limits, exposing potential weaknesses before they become real-world problems.
  • Deepchecks: An LLM evaluation platform that has a significant focus on “quality checks”, but that is also capable of detecting & flagging malicious prompts designed to exploit LLMs (e.g. make them generate harmful content, leak sensitive information, etc).
  • NeMo Guardrails (by Nvidia): A framework designed to implement specific safeguards and boundaries so that they can augment the LLM-based app’s responses in real time, effectively guiding LLMs toward responsible and ethical use.

Stress Testing with Garak

Garak (https://github.com/leondz/garak) – an LLM vulnerability scanner and stress-testing tool, provides a comprehensive set of potential weaknesses in LLMs like hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other. If you know nmap, it’s nmap for LLMs.
Using Garak we can test almost any LLM-based application. In this blogpost we will focus on a product summarization use-case with the following system prompt:

Summarize the text below and describe the product based on the description and user-based queries. Ensure the summary is in a paragraph.
Text: ###{text}###

We will send several types of malicious prompts to the app, and check how it behaves.
For this, we will use Mixtral 8x7B hosted on Nvidia NeMo. NeMo hosts a lot of optimized foundation models and is really easy to use. In the next blog post of this series, we will show how to run Mixtral on top of a Triton server, using the TensorRT-LLM backend.
In Garak, there is support for various types of models or generators (in the garak terminology). In the tutorial we will use the Function generator.

import requests

def invoke(prompt)
  invoke_url = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/8f4118ba-60a8-4e6b-8574-e38a4067a4a3"
  fetch_url_format = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/"
  
  headers = {
      "Authorization": "Bearer $API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC",
      "Accept": "application/json",
  }
  
  payload = {
    "messages": [
      {
        "content": prompt,
        "role": "user"
      }
    ],
    "temperature": 0.2,
    "top_p": 0.7,
    "max_tokens": 1024,
    "seed": 42,
    "stream": False
  }
  
  # re-use connections
  session = requests.Session()
  
  response = session.post(invoke_url, headers=headers, json=payload)
  
  while response.status_code == 202:
      request_id = response.headers.get("NVCF-REQID")
      fetch_url = fetch_url_format + request_id
      response = session.get(fetch_url, headers=headers)
  
  response.raise_for_status()
  response_body = response.json()
  completion = response_body['choices'][0]['message']['content']
  
  # Send the interaction to Deepchecks LLM platform for evaluation
  send_to_deepchecks(prompt, completion)
  return completion

Let’s define the send_to_deepchecks function and have a closer look at what it does:

from deepchecks_llm_client.client import dc_client
from deepchecks_llm_client.data_types import EnvType
dc_client.init(
            host=os.getenv('HOST'),
            api_token=os.getenv('API_TOKEN'),
            app_name=os.getenv('APP_NAME'),
            version_name='1',
            env_type=EnvType.EVAL,
            auto_collect=False
)

def send_to_deepchecks(user_input, output):
  dc_client.log_interaction(input=user_input, output=output)

Now, we will use Garak to run “probes” on this model.

from garak.harnesses.probewise import ProbewiseHarness
from garak.generators.function import Single
from garak._plugins import enumerate_plugins
from garak.evaluators import ThresholdEvaluator
from garak import __version__
from garak import _config
from garak.command import start_run

start_run()
evaluator = ThresholdEvaluator()
harness = ProbewiseHarness()
plugin_names = enumerate_plugins(category="probes")
model = Single(invoke, generations=1, probes=probes)
harness.run(model, probes, evaluator)
Deepchecks For LLM VALIDATION

The Best LLM Safety-Net to Date: Deepchecks, Garak, and NeMo Guardrails All in One Bundle

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Combining with Deepchecks LLM Evaluation Solution

Now, we can view the results in the Deepchecks UI:

Combining with Deepchecks LLM Evaluation Solution

Deepchecks UI after testing Garak’s “probes” on an LLM-based app.

Deepchecks evaluates the weaknesses of the LLM app using Garak’s detectors and built-in algorithms. It aggregates the scores from the different detectors and concludes whether the app is vulnerable to each type of attack.

These scores can help understand the weaknesses of the LLM-based app and will aid in defining the necessary guardrails.

Of course, Deepchecks has many other capabilities related to quality assessment (relevance, grounded in context, etc), and not only to the risk assessment aspect. But that’s a topic for another blog 😊.

Defining Guardrails

Now, after we gather the weaknesses of our model, we can effectively use Deepchecks and NeMo Guardrails to define specific rails on our app. NeMo Guardrails is a comprehensive framework that helps safeguard LLM-based applications.

Programmable Guardrails

Diagram of the different high-level flows through NeMo’s programmable guardrails. You can see the different types of rails in the green rectangles.

It consists of multiple user-defined rails, which specify the expected behavior of the model. The framework makes sure that the LLM app will not respond “beyond” its defined guardrails. The configuration of the rails is pretty straightforward and a nice tutorial exists in the docs.

Note: This configuration, although simple, still seems to be a barrier to getting started for many teams. That’s part of the motivation for enabling Deepchecks to streamline it even more.

An example config.yml of a rail may look like this:

# config.yml
models:
  - type: main
    engine: openai
    model: gpt-3.5-turbo-instruct
rails:
  # Input rails are invoked when new input from the user is received.
  input:
    flows:
      - self check pii
  # Output rails are triggered after a bot message has been generated.
  output:
    flows:
      - self check hallucination

This file mainly defines general configurations like the model, and the active rails.

In addition, we need to define the rails themselves, using the CoLang syntax:

define flow
  user ask about people
  $hallucination_warning = True
  bot respond about people

define bot inform answer prone to hallucination
  "The previous answer is prone to hallucination and may not be accurate."

define subflow self check hallucination
  if $hallucination_warning == True
    $hallucination_warning = False
    $hallucinated = execute self_check_hallucinations
    if $hallucinated
      bot inform answer prone to hallucination
      stop

define user tell pii
  "ID"
  "Password"
  "Credit Card"

define subflow self check pii
  $allowed = execute self_check_pii
  if not $allowed
    bot refuse to respond
    stop

define flow
  user tell pii
  $facts_check = True

And the prompts.yml file:

prompts:
  - task: self_check_pii
    content: >
      Instruction: {{ user_input }}
      
      Would this instruction contains any potential PII data? Answer only with Yes/No.
      
      Answer [Yes/No]:

  - task: self_check_hallucinations
    content: >
      Instruction: {{ user_input }}
      
      You are given a task to identify if the hypothesis is in agreement with the context below.
      You will only use the contents of the context and not rely on external knowledge.
      Answer with yes/no. "context": {{ paragraph }} "hypothesis": {{ statement }} "agreement":
      Answer [Yes/No]:

As it can be inferred, the options are endless! but, instead of writing those rails manually, they can be automatically generated by Deepchecks, correlated with the weaknesses found earlier!

Deepchecks can support multiple LLM guardrail formats, with a current focus on llm-guard and NeMo Guardrails. In this blog post the focus is on Nvidia guardrails, so all we need to do is to download the configured rails in the Nvidia guardrail format:

LLM Guard

GIF from the Deepchecks UI. After finding the weaknesses detected by garak, you can use the built-in capability of downloading the rails.

Now, after putting the downloaded files in place, we will run the Mixtral 8x7B deployed on NeMo, using the guardrails library.

In the downloaded config.py file, we will modify the model section:

# config.yml
models:
  - type: main
    engine: nemollm
    model: mixtral-8x7b
    parameters:
      api_key: <NEMO API key>
      organization_id: <NEMO Org ID>

rails:
  # ...
  # Downloaded from Deepchecks
  # ...

and then run the model:

from nemoguardrails import LLMRails, RailsConfig

# Load a guardrails configuration from the specified path.
config = RailsConfig.from_path("PATH/TO/CONFIG")
rails = LLMRails(config)

completion = rails.generate(
    messages=[{"role": "user", "content": "Hello world!"}]
)

Please note: This is the most simplistic way of running guardrails, and the framework also supports async & streaming generations, in addition to a built-in server.

Conclusion

In this blog post, we explored the importance of LLM security and how a multi-layered approach can mitigate potential risks. We showcased a powerful combination of three tools: Garak, Deepchecks, and NeMo Guardrails. Each tool plays a crucial role in the security process:

  • Garak proactively stress tests LLMs, identifying weaknesses before they become real-world problems.
  • Deepchecks provides comprehensive evaluation (including automatic scoring), version comparison, flagging malicious prompts, and safeguarding LLMs from harmful outputs.
  • NeMo Guardrails acts as a framework for implementing specific safeguards and boundaries, ensuring responsible and ethical LLM usage.

What we’ve shown in this blog is how to start with “pen-testing” in the development/iteration phase using Deepchecks & Garak, and utilize the insights from this phase to easily set up NeMo Guardrails for an LLM-based app use case.

By combining these tools we can ensure the safe and beneficial utilization of LLMs with a pretty straightforward (and easy) process.

😊 Happy (and safe) LLMing!

Good Luck

Deepchecks For LLM VALIDATION

The Best LLM Safety-Net to Date: Deepchecks, Garak, and NeMo Guardrails All in One Bundle

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Recent Blog Posts

LLM Evaluation: When Should I Start?
LLM Evaluation: When Should I Start?
How to Build, Evaluate, and Manage Prompts for LLM
How to Build, Evaluate, and Manage Prompts for LLM