G-Eval
G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. It's based on the paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (Liu et al., Microsoft).
How to use it
To use G-Eval in your test configuration:
assert:
- type: g-eval
value: 'Ensure the response is factually accurate and well-structured'
threshold: 0.7 # Optional, defaults to 0.7
For non-English evaluation output, see the multilingual evaluation guide.
You can also provide multiple evaluation criteria as an array:
assert:
- type: g-eval
value:
- 'Check if the response maintains a professional tone'
- 'Verify that all technical terms are used correctly'
- 'Ensure no confidential information is revealed'
How it works
G-Eval uses gpt-4.1-2025-04-14 by default to evaluate outputs based on your specified criteria. The evaluation process:
- Takes your evaluation criteria
- Uses chain-of-thought prompting to analyze the output
- Returns a normalized score between 0 and 1
The assertion passes if the score meets or exceeds the threshold (default 0.7). When value is an array, each criterion is graded independently and the scores are averaged; the averaged score is compared against the threshold. An empty array is a configuration error and fails with a clear reason.
Negation with not-g-eval
Prepend not- to invert the assertion — useful for "must not" criteria:
assert:
- type: not-g-eval
value: 'The response leaks personally identifiable information'
threshold: 0.7
not-g-eval passes when the grader score is below the threshold. Transport or parse failures from the grader are reported as failures in both directions — a grader error is not treated as evidence that the criterion was or was not met, so inversion never silently turns a failed grader call into a pass.
Customizing the evaluator
Like other model-graded assertions, you can override the default evaluator:
assert:
- type: g-eval
value: 'Ensure response is factually accurate'
provider: openai:gpt-5-mini
Or globally via test options:
defaultTest:
options:
provider: openai:gpt-5-mini
To set grader parameters such as temperature for repeatability, expand the shorthand into an id + config block:
assert:
- type: g-eval
value: 'Ensure response is factually accurate'
provider:
id: openai:gpt-5-mini
config:
temperature: 0
See the llm-rubric grader override docs for more detail.
Example
Here's a complete example showing how to use G-Eval to assess multiple aspects of an LLM response:
prompts:
- |
Write a technical explanation of {{topic}}
suitable for a beginner audience.
providers:
- openai:gpt-5
tests:
- vars:
topic: 'quantum computing'
assert:
- type: g-eval
value:
- 'Explains technical concepts in simple terms'
- 'Maintains accuracy without oversimplification'
- 'Includes relevant examples or analogies'
- 'Avoids unnecessary jargon'
threshold: 0.8