The release of ChatGPT and its rapid adoption by over 100 million users in its first two months has clearly shown the potential of Large Language Models (LLMs) to the world. For the developer community, foundational models such as GPT-4, PALM, LLaMA, and DALL-e have also changed how AI-enabled applications are built and deployed. The rise of LLMs has increased developer activity, with tens of thousands of developers building LLM applications, including chatbots, document summarization, question-answering, search, and generative writing tasks such as articles and essays.

While the impact of using foundational models is huge, it’s important to critically evaluate their quality for each use case. As a result, a typical workflow when building LLM applications often involves significant experimentation. For example, a developer building a question-answering application might test the application by providing a series of questions and manually reviewing the answers.

If the answers are inaccurate, necessary adjustments are made by either changing the prompts, conducting hyperparameter tuning, or fine-tuning the LLM and re-testing the new app revision. This manual process is highly time-consuming and repetitive, however, and only a few useful tools are available to measure app performance and quality metrics. It is also challenging to track the progress of the metrics after each revision. This is where TruLens fits into the picture.

What is TruLens?

TruLens aims to fill the gap in the LLMOps stack by supporting developers in evaluating and tracking LLM experiments. Through the novel framework known as “feedback functions,” TruLens enables programmatic evaluation of the quality of inputs, outputs, and intermediate results from LLM applications. It comes with out-of-the-box feedback functions and allows developers to build custom, tailor-made functions to suit their applications. TruLens also easily integrates with developer frameworks such as LlamaIndex and LangChain for easier consumption.

The following image illustrates the workflow for rapid LLM app development and evaluation using TruLens.


Image Credits – Truera

After building an LLM application, TruLens can be connected to the app to begin recording logs. Subsequently, feedback functions can be configured to log and evaluate the quality of the LLM app. You can visualize the trend of evaluation results in the TruLens dashboard, allowing easier selection of the best LLM chain version suitable for the application.

Feedback Functions

Feedback functions provide a new method for augmenting human feedback. These functions generally use LLM-generated text as input with metadata and return a score. Feedback functions can be built with simple rule-based systems or discriminatory machine learning models such as the ones used for sentiment analysis, explicit feedback from humans (e.g., thumbs up/ thumbs down), or even another LLM. The following example shows a feedback function in action.


Image Credits – Truera

Once the functions are created, they can be used to test and evaluate the models while they are being developed. They can also be integrated with the application’s inferencing capabilities to enable monitoring at scale. This enables segmentation and tracking of the LLM performance over different LLM versions, as shown below.


Image Credits – Johannes Jolkkonen YouTube

Let’s look at some of the out-of-the-box feedback functions and their relevant usages,

  • Language match: This function checks the language match between the prompt and the response. Generally, a natural user will expect the response to be in the same language as the prompt, and this function is useful for checking this. It calls a HuggingFace API behind the scenes to programmatically check for the language match.
  • Response relevance: This function is useful for checking how relevant the response is to the prompt by utilizing an OpenAI LLM that is prompted to generate a relevance score. There’s also a variation of this function that enables chain-of-thought (CoT) reasoning, which outputs the reasons behind the score produced.
  • Context relevance: Similar to the response function, this too uses an OpenAI chat completion model to check the relevance of the context of the answer to its question. This function also provides a CoT reasoning variant.
  • Groundedness: This function validates if the answer is grounded in its provided source content and uses an LLM provider to achieve this. This is useful for evaluating whether the developed LLM application is hallucinating or staying true to its source material.

The full list of stock feedback functions can be found here. TruLens’s ultimate goal is to enable developers to take advantage of such feedback functions to evaluate LLM applications at scale. The ability to quantitatively assess and evaluate each version of an LLM enhances the LLMOps workflow, ultimately helping to produce more relevant, high-quality LLM applications.

Trulens Cost

One practical consideration when using TruLens is its cost. It is important to check when feedback functions utilized in the workflow rely on invoking other LLMs over APIs. When utilizing such functions, a cost-benefit analysis needs to be conducted to see if the added benefit of detecting issues with accuracy and fairness is worth it. A few ways to overcome the cost issue include,

  • Only utilizing feedback functions that rely on free implementations available from OpenAI and HuggingFace.
  • In some cases, relying on a suite of comprehensive lower-cost feedback functions can provide deeper insights than an individual higher-quality one.
  • Over time, TruLens expects more feedback functions to be incorporated that integrate low-cost mechanisms, such as using previous-generation BERT-style foundation models and simpler rule-based systems. Using such functions can also help reduce costs.