G-Eval

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. It's based on the paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (Liu et al., Microsoft).

How to use it

To use G-Eval in your test configuration:

assert:
  - type: g-eval
    value: 'Ensure the response is factually accurate and well-structured'
    threshold: 0.7 # Optional, defaults to 0.7

For non-English evaluation output, see the multilingual evaluation guide.

You can also provide multiple evaluation criteria as an array:

assert:
  - type: g-eval
    value:
      - 'Check if the response maintains a professional tone'
      - 'Verify that all technical terms are used correctly'
      - 'Ensure no confidential information is revealed'

How it works

G-Eval uses gpt-4.1-2025-04-14 by default to evaluate outputs based on your specified criteria. The evaluation process:

Takes your evaluation criteria
Uses chain-of-thought prompting to analyze the output
Returns a normalized score between 0 and 1

The assertion passes if the score meets or exceeds the threshold (default 0.7). When value is an array, each criterion is graded independently and the scores are averaged; the averaged score is compared against the threshold. An empty array is a configuration error and fails with a clear reason.

Negation with `not-g-eval`

Prepend not- to invert the assertion — useful for "must not" criteria:

assert:
  - type: not-g-eval
    value: 'The response leaks personally identifiable information'
    threshold: 0.7

not-g-eval passes when the grader score is below the threshold. Transport or parse failures from the grader are reported as failures in both directions — a grader error is not treated as evidence that the criterion was or was not met, so inversion never silently turns a failed grader call into a pass.

Customizing the evaluator

Like other model-graded assertions, you can override the default evaluator:

assert:
  - type: g-eval
    value: 'Ensure response is factually accurate'
    provider: openai:gpt-5-mini

Or globally via test options:

defaultTest:
  options:
    provider: openai:gpt-5-mini

To set grader parameters such as temperature for repeatability, expand the shorthand into an id + config block:

assert:
  - type: g-eval
    value: 'Ensure response is factually accurate'
    provider:
      id: openai:gpt-5-mini
      config:
        temperature: 0

See the llm-rubric grader override docs for more detail.

Example

Here's a complete example showing how to use G-Eval to assess multiple aspects of an LLM response:

prompts:
  - |
    Write a technical explanation of {{topic}} 
    suitable for a beginner audience.
providers:
  - openai:gpt-5
tests:
  - vars:
      topic: 'quantum computing'
    assert:
      - type: g-eval
        value:
          - 'Explains technical concepts in simple terms'
          - 'Maintains accuracy without oversimplification'
          - 'Includes relevant examples or analogies'
          - 'Avoids unnecessary jargon'
        threshold: 0.8

How to use it​

How it works​

Negation with not-g-eval​

Customizing the evaluator​

Example​

Further reading​

How to use it

How it works

Negation with `not-g-eval`

Customizing the evaluator

Example

Further reading