G-Eval – Evaluate NLG with Human-like Reasoning

Natural Language Generation (NLG) is a software process powered by AI that allows computers to generate text based on various structured and unstructured data. When using NLG, the output is in a language that humans can understand, allowing computers to communicate with us efficiently.

As powerful as these tools may be, they have their limits. If you are familiar with ChatGPT or similar NLG tools, you most probably came across problems such as receiving correct but completely irrelevant information or information that has been hallucinated. So, these models must be properly tested and evaluated to minimize the risk of getting bad responses. Here are several widely used evaluation metrics:

  • Statistical methods: BLEU, ROUGE, METEOR
  • Model-based methods: NLI, BLEURT, G-Eval
  • Combination of statistical and model-based concepts: BERTScore, MoverScore

Let’s look at G-Eval, a model-based evaluation framework that uses an LLM to evaluate the generated output. G-Eval uses a powerful model like GPT-4 to evaluate text generated based on any given criteria. The specialty of this method is that it aims to achieve a higher correlation with human judgments than other methods. It also allows us to define evaluation metrics, making it a suitable method for task-specific metrics. For example, a generated summary can be judged by conciseness, relevancy, coherence, or any custom metric. The process of G-Eval can be broken into two parts.

  • Introduce the task and clearly define the evaluation criteria. This information will be used to generate a set of evaluation steps.
  • Give the input, generated text, context, and evaluation steps to the LLM. Ask it to generate a score from 1 to 5 (where 5 is best).

Let’s go through these steps in more detail using an example. Consider an NLG task where a summary of an article has been generated, and we need to evaluate this summary using G-Eval.

Task and Evaluation Criteria

In the first stage, we provide a prompt to the LLM model selected as the evaluator to generate step-by-step instructions for the evaluation. The prompt must contain information about the task, such as the text being evaluated, how many criteria it will be evaluated on, and what type of input was given.

For our example, we could provide this prompt:

“You will be given one summary written for an article. Your task is to rate this summary based on one metric.”

In addition, we need to describe the metric that we require in detail. This evaluation criteria can be customized for our task. For example, we can evaluate the generated summary’s conciseness, coherence, and grammar and whether it contains all the relevant information.

  • “Coherence (1-5)- the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby “the summary should be well structured and organized. The summary should not just be a heap of related information but build from sentence to sentence to a coherent body of information about a topic.”
  • “Relevancy (1-5)- the degree to which the content of the text is directly related to the given topic or query. We align this dimension with the necessity for the summary to address the key points of the prompt. The summary should stay focused on the topic, ensuring that every sentence contributes relevant information, avoiding any unrelated details.”


  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Generation of the Evaluation Steps

Chain of thought (CoT) prompting is an approach that encourages LLMs to break down a complex thought into intermediate steps. In this method, the prompts guide the model through a step-by-step reasoning process. Thus, when given a complex task, the model will know how to break it down into smaller steps. CoT prompting has been shown to enhance LLM performance on tasks involving complex arithmetic and commonsense reasoning.

G-Eval also uses the CoT to generate steps for the evaluation process, and this is a key factor in making G-Eval correlate better with human reasoning. It takes information about the task and criteria to generate a step-by-step plan miming the detailed and logical approach we humans use when evaluating text. For our example, evaluating coherence might look like as follows;

  • Read the article carefully and identify the main topic and key points.
  • Read the summary and compare it to the article. Check if the summary covers the article’s main topic and key points and if it presents them in a clear and logical order. This step is automatically created by the LLM to evaluate the coherence in summary.
  • Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest, based on the evaluation criteria.

Scoring Function

Finally, a scoring function will call the LLM with the designed prompt, the auto CoTs, and the input and output text that needs to be generated. In our example, a prompt will be called with the initial prompt, the CoT, the input article, and the generated summary, asking the LLM to give a score between 1 and 5.

It should be noted that in the final stage, getting just a single generated textual answer for the score is not advisable. The following issues can occur:

  1. In some scenarios, a single score (like ‘3’) is predicted much more often when evaluating, leading to low variance and poor correlation with human judgment.
  2. LLMs tend to output integer scores. Thus, when comparing generated samples for the same task, there are many ties, and the scores do not capture subtle differences.


Option 1: From the LLM, the probabilities of the output tokens can be obtained. Then, a weighted summation can be obtained as the final result.

Option 2: There are many cases where LLMs do not allow access to output token probabilities. In such cases, the probabilities are evaluated the same way several times.

Using these methods, it is possible to obtain scores sensitive to subtle differences and qualities in text.


G-Eval is a powerful and versatile framework that can be used to greatly enhance the performance of LLMs and NLG systems. It is flexible enough to use any custom metric we describe and follows a step-by-step evaluation process, much like a human.

Using G-Eval in Code

G-Eval is available to use as a part of DeepEval, an open-source framework for evaluating LLMs.

Link: DeepEvalDocs


G-Eval is a recently developed framework based on a paper titled “G-EVAL: NLGEvaluation using GPT-4 with Better Human Alignment”.

Link: ResearchArticle