GPT vs Claude vs Gemini: Benchmark on Your Own Data

When evaluating the performance of LLMs, generic benchmarks will only get you so far. Model capabilities set a ceiling on what you're able to accomplish, but in our experience most LLM apps are highly dependent on their prompting and use case.

So, the sensible thing to do is run an eval on your own data.

This guide will walk you through setting up a comparison between OpenAI's GPT-5.2, Anthropic's Claude Sonnet 4.6, and Google's Gemini using promptfoo. The end result is a side-by-side evaluation of how these models perform on custom tasks:

Prerequisites

Before getting started, make sure you have:

The promptfoo CLI installed (installation instructions)
API keys for the providers you want to test:
- OPENAI_API_KEY for OpenAI (configuration)
- ANTHROPIC_API_KEY for Anthropic (configuration)
- GOOGLE_API_KEY for Google AI (configuration)

Step 1: Set Up Your Evaluation

Create a new directory for your comparison project:

npx promptfoo@latest init --example gpt-vs-claude-vs-gemini
cd gpt-vs-claude-vs-gemini

Open the promptfooconfig.yaml file. This is where you'll configure the models to test, the prompts to use, and the test cases to run.

Configure the Models

Specify the models you want to compare under providers:

providers:
  - openai:gpt-5.2
  - anthropic:claude-sonnet-4-6
  - google:gemini-3.1-pro-preview

You can optionally set parameters like temperature and max tokens for each model:

providers:
  - id: openai:gpt-5.2
    config:
      max_tokens: 1024
  - id: anthropic:claude-sonnet-4-6
    config:
      temperature: 0.3
      max_tokens: 1024
  - id: google:gemini-3.1-pro-preview
    config:
      temperature: 0.3
      maxOutputTokens: 1024

You don't have to compare all three at once. If you only want to compare GPT vs Claude, or GPT vs Gemini, just remove the provider you don't need from the list. Any combination of two or more models works.

Define Your Prompts

Next, define the prompt(s) you want to test the models on. For this example, we'll just use a simple prompt:

prompts:
  - 'Answer this riddle: {{riddle}}'

If desired, you can use a prompt template defined in a separate prompt.yaml or prompt.json file. This makes it easier to set the system message, etc:

prompts:
  - file://prompt.yaml

The contents of prompt.yaml:

- role: system
  content: 'You are a careful riddle solver. Be concise.'
- role: user
  content: |
    Answer this riddle:
    {{riddle}}

The {{riddle}} placeholder will be populated by test case variables.

It's also possible to assign specific prompts for each model, in case you need to tune the prompt to each model:

prompts:
  prompts/gpt_prompt.json: gpt_prompt
  prompts/gemini_prompt.json: gemini_prompt

providers:
  - id: google:gemini-3.1-pro-preview
    prompts: gemini_prompt
  - id: openai:gpt-5.2
    prompts:
      - gpt_prompt
  - id: anthropic:claude-sonnet-4-6
    prompts:
      - gpt_prompt

Step 2: Create Test Cases

Now it's time to create a set of test cases that represent the types of queries your application needs to handle.

The key is to focus your analysis on the cases that matter most for your application. Think about the edge cases and specific competencies that you need in an LLM.

In this example, we'll use a few riddles to test the models' reasoning and language understanding capabilities:

tests:
  - vars:
      riddle: 'I speak without a mouth and hear without ears. I have no body, but I come alive with wind. What am I?'
    assert:
      - type: icontains
        value: echo
      - type: llm-rubric
        value: Do not apologize
  - vars:
      riddle: "You see a boat filled with people. It has not sunk, but when you look again you don't see a single person on the boat. Why?"
    assert:
      - type: llm-rubric
        value: explains that the people are below deck or they are all in a relationship
  - vars:
      riddle: 'The more of this there is, the less you see. What is it?'
    assert:
      - type: icontains
        value: darkness
  # ... more test cases

The assert blocks allow you to automatically check the model outputs for expected content. This is useful for tracking performance over time as you refine your prompts.

tip

promptfoo supports a very wide variety of assertions, ranging from basic asserts to model-graded to assertions specialized for RAG applications.

Learn more here

Step 3: Run the Evaluation

With your configuration complete, you can kick off the evaluation:

npx promptfoo@latest eval

This will run each test case against all configured models and record the results.

To view the results, start up the promptfoo viewer:

npx promptfoo@latest view

This will display a comparison view showing how each model performed on each test case:

You can also output the raw results data to a file:

npx promptfoo@latest eval -o results.json

Step 4: Analyze the Results

With the evaluation complete, it's time to dig into the results and see how the models compared on your test cases.

Some key things to look for:

Which model had a higher overall pass rate on the test assertions? In this case, all three models got the riddles correct, which is great - these riddles often trip up less powerful models.
Were there specific test cases where one model significantly outperformed the other?
How did the models compare on other output quality metrics.
Consider model properties like speed and cost in addition to quality.

Here are a few observations from our example riddle test set:

GPT's responses tended to be short and direct, while Claude often includes extra commentary
Gemini's responses were the most terse
GPT was the fastest, while Gemini's reasoning overhead made it the slowest

Adding assertions for things we care about

Based on the above observations, let's add the following assertions to all tests in this eval using defaultTest:

Latency must be under 5000 ms
Sliding scale Javascript function that penalizes long responses

defaultTest:
  assert:
    # Inference should always be faster than this (milliseconds)
    - type: latency
      threshold: 5000
    # Penalize long responses on a sliding scale
    - type: javascript
      value: 'output.length <= 100 ? 1 : output.length > 1000 ? 0 : 1 - (output.length - 100) / 900'

The result is that Gemini sometimes fails our latency requirements:

Clicking into a specific test case shows the individual test results:

The tradeoff between latency and accuracy is going to be tailored for each application. That's why it's important to run your own eval.

Of course, our requirements are different from yours. You should customize these values to suit your use case.

Testing Logic and Reasoning

Riddles are fun, but you can also test models on logic and reasoning tasks. Here are some examples from a Hacker News thread:

tests:
  - vars:
      question: There are 31 books in my house. I read 2 books over the weekend. How many books are still in my house?
    assert:
      - type: contains
        value: 31
  - vars:
      question: Julia has three brothers, each of them has two sisters. How many sisters does Julia have?
    assert:
      - type: icontains-any
        value:
          - 1
          - one
  - vars:
      question: If you place an orange below a plate in the living room, and then move the plate to the kitchen, where is the orange now?
    assert:
      - type: contains
        value: living room

For more complex validations, you can use models to grade outputs, custom JavaScript or Python functions, or even external webhooks. Have a look at all the assertion types.

You can use llm-rubric to run free-form assertions. For example, here we use the assertion to detect a hallucination about the weather:

- vars:
    question: What's the weather in New York?
  assert:
    - type: llm-rubric
      value: Does not claim to know the weather in New York

Testing Vision and Multimodal

If you're working on an application that involves classifying images, you can set up a comparison using promptfoo. Here's an example of a binary image classification task:

promptfooconfig.yaml
providers:
  - openai:gpt-5.2
  - anthropic:claude-sonnet-4-6
  - google:gemini-3.1-pro-preview

prompts:
  - |
    role: user
    content:
      - type: text
        text: Please classify this image as a cat or a dog in one word in lower case.
      - type: image_url
        image_url:
          url: "{{url}}"
tests:
  - vars:
      url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Felis_catus-cat_on_snow.jpg/640px-Felis_catus-cat_on_snow.jpg'
    assert:
      - type: equals
        value: 'cat'
  - vars:
      url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/American_Eskimo_Dog.jpg/612px-American_Eskimo_Dog.jpg'
    assert:
      - type: equals
        value: 'dog'

Run the comparison with the promptfoo eval command to see how each model performs on your image classification task. While larger models may provide higher accuracy, smaller models' lower cost makes them an attractive option for applications where cost-efficiency is crucial.

The tradeoff between cost, latency, and accuracy is going to be tailored for each application. That's why it's important to run your own evaluation.

Conclusion

By running this type of targeted evaluation, you can gain valuable insights into how these models are likely to perform on your application's real-world data and tasks.

promptfoo makes it easy to set up a repeatable evaluation pipeline so you can test models as they evolve and measure the impact of model and prompt changes.

The key here is that your results may vary based on your LLM needs, so we encourage you to enter your own test cases and choose the model that is best for you.

To learn more about promptfoo, check out the getting started guide and configuration reference.

Prerequisites​

Step 1: Set Up Your Evaluation​

Configure the Models​

Define Your Prompts​

Step 2: Create Test Cases​

Step 3: Run the Evaluation​

Step 4: Analyze the Results​

Adding assertions for things we care about​

Testing Logic and Reasoning​

Testing Vision and Multimodal​

Conclusion​