DEEPCHECKS GLOSSARY

LLM Output Parsing

Large language models (LLMs) are highly versatile tools that perform various tasks, including language translation, question answering, and code generation. Chat tools like ChatGPT, Claude, and Gemini have significantly improved conversational AI by enabling users to engage with these tools in a manner that feels as natural as conversing with another human. The use of these LLMs can be further improved by building assistant-like tools where users can instruct in natural language to complete certain tasks. This would significantly enhance productivity and efficiency, making these tools invaluable.

However, integrating LLM output with other tools has its challenges. Even though LLMs are powerful tools capable of many things, general LLMs are often trained to respond conversationally. Due to the conversational nature of the responses, it is hard to get a consistent response from an LLM.

Consider the following scenario. A restaurant’s user reviews must be analyzed. Information from the review must be extracted and fed into an analytics engine to further analyze and visualize the data.

If we use ChatGPT to do this, we could write a prompt like the one below.

Extract the following information from the provided restaurant review:

Determine the overall sentiment of the review (positive, negative, or neutral).
Assign a score from 1 to 10, indicating how likely the reviewer would recommend the restaurant to their friends and family based on the content of the review.
Identify and extract the key themes or aspects discussed in the review.

Review:
{review text}

Following is the generated response when this sample review

Burgers & Bites offers a mixed experience. The staff is exceptionally amicable, and the food is equally tasteful but time-consuming. Unfortunately, one will find a limited variety of vegetarian dishes and no vegan options. Parking can be tricky, especially during peak hours. Nevertheless, the quality of the burgers compensates for these shortcomings—literally.

is passed to ChatGPT.

  1. Overall sentiment of the review: Positive
  2. Score (1 to 10, likelihood of recommendation): 7
  3. Key themes the review talks about:
    • Staff friendliness: “The staff is exceptionally amicable”
    • Food quality: “the food is equally tasteful”
    • Service time: “time-consuming”
    • Vegetarian and vegan options: “limited variety of vegetarian dishes and no vegan options”
    • Parking difficulties: “The parking is tricky, particularly if the store is bustling with customers”
    • Burger quality: “the quality of the burgers makes up for it”

Even though this is a well-formatted response, parsing this programmatically will be difficult. Also, even when the same review is passed to the LLM multiple times, it will generate slightly different outputs each time, as LLMs are generative, not deterministic.

How can we solve it?

There are a couple of ways to solve this problem.

  1. Prompt engineering and output parsers
  2. Function/Tool calling with LLM

Why not fine-tune the model?

Fine-tuning the model can help generate a structured and consistent output. However, it limits the model’s versatility by making it highly specialized for a specific task. Additionally, fine-tuning and running a model is expensive. Since we have other effective methods to get structured responses without altering the model’s behavior, fine-tuning is unnecessary for this particular task.

Prompt Engineering and Output Parsers

Prompt engineering techniques can help LLMs generate more consistent output in a defined format. For example, let’s take the same example of sentiment analysis and modify the prompt with specific instructions on how the response should be formatted. The LLM will likely respond using the specified format. Output parsers are implemented to simplify this. Output parsers combine prompt engineering and a parser that parses the output of LLM and converts it to a language construct to be used directly within the program. Also, the output parsers can validate the output against the expected format and ask the LLM to fix the errors if there are any.

Many popular frameworks, such as LangChain and LlamaIndex, have implemented output parsers.

Let’s see how we can improve the sentiment analysis example with LangChain output parsers.

from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_openai import ChatOpenAI

class Review(BaseModel):
    sentiment: str
    score: int
    themes: list[str]

parser = JsonOutputParser(pydantic_object=Review)

prompt_template = PromptTemplate(
    template="""Extract the following information from the provided restaurant review: 
    Determine the overall sentiment of the review (positive, negative, or neutral).
    Assign a score from 1 to 10, indicating how likely the reviewer would recommend the restaurant to their friends and family based on the content of the review.
    Identify and extract the key themes or aspects discussed in the review.
    n{format_instructions}nn
    Review: {review}n""",
    input_variables=["review"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)
  • Define a data model: The Review class defines the structure of the expected output using pydantic.
  • Create an output parser: In this example, JsonOutputParser is used. LangChain supports multiple output parser types from which to choose. The defined model is passed to the output parser here to define a restricted schema for the output. Although defining a schema is optional for the JsonOutputParser, having a defined schema simplifies the integration process by clearly outlining the expected output format.
  • Create a prompt template: Since this program is designed to analyze multiple prompts, a reusable template is created to interact with the LLM. Format instructions generated by the parser and the input review are injected into this template through parameters.
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
chain = prompt_template | llm | parser

print(chain.invoke({"review": <Review to be analyzed>}))

When we send this prompt to an LLM (in this example, OpenAI’s gpt-3.5-turbo model is used, but any compatible model can be utilized), the result will be a properly structured Python object in the Review format.

{'sentiment': 'positive',
 'score': 8,
 'themes': ['Staff friendliness',
  'Food quality',
  'Limited vegetarian options',
  'No vegan options',
  'Parking challenges during peak hours']}

Following is the prompt sent to the LLM; we can see that additional instructions have been appended to the original prompt.

Deepchecks For LLM VALIDATION

LLM Output Parsing

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Extract the following information from the provided restaurant review:

Determine the overall sentiment of the review (positive, negative, or neutral).

Assign a score from 1 to 10, indicating how likely the reviewer would recommend the restaurant to their friends and family based on the content of the review.

Identify and extract the key themes or aspects discussed in the review.

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {“properties”: {“foo”: {“title”: “Foo”, “description”: “a list of strings”, “type”: “array”, “items”: {“type”: “string”}}}, “required”: [“foo”]}
the object {“foo”: [“bar”, “baz”]} is a well-formatted instance of the schema. The object {“properties”: {“foo”: [“bar”, “baz”]}} is not well-formatted.

Here is the output schema:

{“properties”: {“sentiment”: {“title”: “Sentiment”, “type”: “string”}, “score”: {“title”: “Score”, “type”: “integer”}, “themes”: {“title”: “Themes”, “type”: “array”, “items”: {“type”: “string”}}}, “required”: [“sentiment”, “score”, “themes”]}

Review: <INPUT REVIEW>

Using output parsers is a flexible method to ensure LLMs consistently respond in the expected format. However, errors can still occur, so handling them properly is essential.

One way to handle errors is by wrapping the output parsing code in a try-catch block and handling the exception. LangChain also offers additional output parsers that can wrap around an existing output parser to handle errors.

  • [**OutputFixingParser**](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/output_fixing/): If the output is misformatted, this parser sends another LLM call to ask the LLM to attempt to fix the error.
  • [**RetryParser**](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/retry/): This parser retries the request in the event of a parsing error.

Function/Tool Calling with LLM

Function calling is a capability provided by the LLM. From the user prompt, it determines whether a predefined function needs to be called and generates an output that matches the function’s input requirements as a JSON response. Many popular LLMs like gpt-3.5-turbo, gpt-4 gemini-1.0, claude-3, and llama-3 support a form of function calling.

Let’s implement the sentiment analysis sample with tool calling.

from openai import OpenAI

llm = OpenAI()

messages = [{"role": "user", "content": <Review to be analyzed>}]

tool_def = {
    "type": "function",
    "function": {
        "name": "analyse_review",
        "description": "Analyse the review sentiment and themes with recommendation score",
        "parameters": {
            "type": "object",
            "properties": {
                "sentiment": {
                    "type": "string",
                    "enum": ["positive", "negative", "neutral"],
                    "description": "Overall sentiment of the review",
                },
                "score": {
                    "type": "integer",
                    "description": "Score from 1 to 10 indicating how likely the reviewer would recommend the restaurant based on the review content",
                    "minimum": 1,
                    "maximum": 10
                },
                "themes": {
                    "type": "array",
                    "items": {
                        "type": "string",
                    },
                    "description": "Key themes or aspects discussed in the review"
                }
            },
            "required": ["sentiment", "score", "themes"]
        }
    }
}

response = llm.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages,
    tools=[tool_def],
    tool_choice={"type": "function", "function": {"name": "analyse_review"}}
)
print(response.choices[0].message.tool_calls[0].function.arguments)

In the above code, we define the structure of our function and its arguments using a JSON schema. We then include this tool definition with the messages in the LLM call. The tool_choice parameter specifies how the model should use the defined tool. In the example, we explicitly instruct the model to use the defined function. If you prefer the model to decide independently, you can set the tool_choice value to auto.

This code also outputs the result in the expected structure.

{
   "sentiment":"positive",
   "score":8,
   "themes":[
      "friendly staff",
      "tasteful food",
      "limited vegetarian options",
      "no vegan options",
      "time-consuming food",
      "limited variety",
      "tricky parking",
      "quality burgers"
   ]
}

This is another method to get the LLM to generate a structured output based on user specifications. Since the model itself provides this functionality, usage may vary slightly depending on the model provider.

Conclusion

Integrating LLM output with other tools can greatly enhance the effectiveness of AI-powered solutions. Techniques like prompt engineering, output parsers, and function/tool calling enable the generation of structured and consistent responses from LLMs, bypassing the need for costly fine-tuning. By leveraging frameworks like LangChain or built-in function calling capabilities, developers can create robust applications that transform LLM outputs into actionable data.