Skip to main content

Evaluate OpenAI Agents (Python SDK)

Use the Python openai-agents SDK with Promptfoo by wrapping your agent as a Python provider. This gives you full control over agent code, tools, sessions, and framework-specific tracing, while still letting Promptfoo score outputs and assert on the traced workflow.

note

The built-in openai:agents:* provider is for the JavaScript @openai/agents SDK. For the Python SDK, use the Python provider path described here.

Quick Start

npx promptfoo@latest init --example openai-agents
cd openai-agents

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

export OPENAI_API_KEY=your_api_key_here

# Run the eval
npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache

# Optional: also emit a provider-level Python OpenTelemetry span
PROMPTFOO_ENABLE_OTEL=true npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache
npx promptfoo@latest view

What The Example Covers

  • multi-turn execution over a persistent SQLiteSession
  • SDK 0.14 SandboxAgent execution over a staged Unix-local Python workspace
  • local-shell skill mounting with ShellTool(environment={"type": "local", "skills": [...]})
  • specialist handoffs between a triage agent, an FAQ agent, and a seat-booking agent
  • Promptfoo trace ingestion of the SDK's internal spans
  • assertions on tool usage, tool arguments, sandbox commands, agent spans, tool order, and overall task success

How The Tracing Works

Promptfoo can only assert on tool paths if it receives the agent's internal spans. The example does that by installing a custom TracingProcessor for the OpenAI Agents SDK and exporting those spans to Promptfoo's OTLP receiver.

At a high level:

  1. Promptfoo enables tracing and injects a W3C traceparent into the Python provider context.
  2. The example parses that trace context and configures a custom OpenAI Agents tracing processor.
  3. The processor converts OpenAI Agents spans into OTLP JSON.
  4. Promptfoo ingests those spans and makes them available in the Trace Timeline and trajectory:* assertions.

If you skip this exporter, Promptfoo will not see the SDK's tool and handoff spans, so trajectory:* assertions will not have the trace data they need.

If you also enable Promptfoo's Python OpenTelemetry wrapper instrumentation with PROMPTFOO_ENABLE_OTEL=true, the example will emit a provider-level Python span as well. The custom SDK spans will inherit that active OpenTelemetry span as their parent. The example config accepts both OTLP JSON and OTLP/protobuf because the SDK bridge emits JSON while the wrapper exporter uses protobuf by default.

SDK 0.14 adds custom spans for sandbox lifecycle work, and the SandboxAgent's shell tool emits exec_command function-tool spans. The example bridge maps SDK custom spans into normal OTLP attributes such as sandbox.operation, command, and process.exit.code, while Promptfoo normalizes OpenAI Agents exec_command tool spans as command trajectory steps. The same mapping also exposes command spans emitted by the SDK's experimental Codex tool as command and codex.command.

Assertion Pattern

The example config asserts on the agent's actual behavior instead of only the final message:

vars:
steps_json: |
[
"My name is Ada Lovelace and my confirmation number is ABC123.",
"Move me to seat 14C.",
"Also, what is the baggage allowance?"
]

assert:
- type: trajectory:tool-used
value:
- lookup_reservation
- update_seat
- faq_lookup

- type: trajectory:tool-args-match
value:
name: update_seat
args:
confirmation_number: ABC123
new_seat: 14C
mode: partial

- type: trajectory:tool-sequence
value:
steps:
- lookup_reservation
- update_seat
- faq_lookup

- type: trajectory:step-count
value:
type: span
pattern: 'agent *'
min: 3

- type: trace-error-spans
value:
max_count: 0

Use trajectory:goal-success when you want a judge model to decide whether the traced workflow actually completed the task, not just whether it hit the right tool path.

Long-Horizon Tasks

The example turns one eval row into a long-horizon task by passing a JSON-encoded list of user turns in vars.steps_json. The provider parses that JSON and executes the turns sequentially against a shared SQLiteSession, which lets the SDK preserve working memory across turns inside a single Promptfoo test case.

The example also returns tokenUsage.numRequests, cached-input tokens, and reasoning-token detail from the SDK's raw model responses. That preserves the real multi-call footprint of handoffs and tool/model loops instead of collapsing every eval row to one request.

That pattern is useful when you want to evaluate:

  • multi-step workflows that need memory
  • agent handoffs over time
  • task completion after several intermediate actions
  • regressions in tool usage across longer trajectories

Promptfoo does not infer a dollar cost for this path automatically. A Python provider can mix models, hosted tools, and custom backends inside one agent graph, while the SDK's aggregate usage objects do not identify the priced model for each request. Return cost from your provider only when you can account for every billed model and hosted tool used by the run.

Sandbox Agents

OpenAI Agents SDK 0.14 introduced SandboxAgent, Manifest, and SandboxRunConfig for agents that need a live filesystem. Promptfoo does not need a special provider for this path: keep using a Python provider and pass a sandbox run config to the SDK.

The bundled example follows the same shape as the SDK's official sandbox coding examples: stage a small repo with a task file, source file, tests, and maintainer instructions; force the agent to inspect the workspace through shell commands; then assert on both the answer and the trace.

from agents import ModelSettings, Runner
from agents.run import RunConfig
from agents.sandbox import Manifest, SandboxAgent, SandboxRunConfig
from agents.sandbox.entries import File
from agents.sandbox.sandboxes.unix_local import UnixLocalSandboxClient

agent = SandboxAgent(
name="Workspace analyst",
model="gpt-5.4-mini",
instructions="Inspect the workspace with shell before answering.",
default_manifest=Manifest(
entries={
"repo/task.md": File(content=b"Find the high-severity issue."),
}
),
model_settings=ModelSettings(include_usage=True),
)

result = Runner.run_sync(
agent,
"Inspect the staged repo and summarize the issue.",
run_config=RunConfig(
sandbox=SandboxRunConfig(client=UnixLocalSandboxClient()),
),
)

The bundled example includes a sandbox-workflow provider label and a sandbox test that asserts the agent reported the staged ticket, ran the requested unittest command, and emitted the expected sandbox trace shape:

assert:
- type: trace-span-count
value:
pattern: tool exec_command
min: 2
- type: trace-span-count
value:
pattern: sandbox.start
min: 1
- type: trace-span-count
value:
pattern: response *
min: 2
- type: trajectory:step-count
value:
type: command
pattern: '*unittest*'
min: 1

Use UnixLocalSandboxClient for local development, DockerSandboxClient when you need container isolation, and hosted sandbox clients when your application already depends on managed execution. Keep credentials and secrets out of staged Manifest files unless the sandbox backend and trace redaction policy are appropriate for that data.

Skills

The Python SDK exposes Agent Skills through shell environments rather than through Codex-style ambient discovery. For a local, reproducible eval, mount the skill on ShellTool explicitly. The bundled example also defines a small SkillShellExecutor that runs those local shell commands:

from pathlib import Path

from agents import Agent, ShellTool

discount_review_skill = {
"name": "discount-review",
"description": "Inspect the discount policy fixture with the bundled checklist.",
"path": "/path/to/skills/discount-review",
}

agent = Agent(
name="Local Skill Analyst",
instructions="Use the discount-review skill for discount-policy review tasks.",
tools=[
ShellTool(
environment={
"type": "local",
"skills": [discount_review_skill],
},
executor=SkillShellExecutor(cwd=Path(__file__).parent),
)
],
)

The bundled skill-workflow example keeps the task small on purpose: it mounts a discount-review skill, asks the agent to inspect a local fixture repo, and has the skill run a helper script before answering.

assert:
- type: trajectory:step-count
value:
type: command
pattern: '*discount-review/SKILL.md*'
min: 1
- type: trajectory:step-count
value:
type: command
pattern: '*analyze_discount_policy.py*'
min: 1
- type: contains
value: return discount_percent >= 20
- type: not-contains
value: 'stderr:'

Today, the Python SDK does not expose a first-class skill invocation event that Promptfoo can normalize into skill-used. For Python SDK skill evals, assert on the observable workflow instead: the skill file was read, the helper command ran cleanly, and the final answer reflects the skill's result. If your application already tracks selected skills, you can also return metadata.skillCalls from the Python provider yourself and use Promptfoo's skill-used assertion on top of that.

Hosted shell follows the same eval idea, but the attachment shape changes from a local path to a hosted skill_reference. Keep local shell for examples you want users to run from a fresh clone; use hosted shell when your product already depends on uploaded, versioned skills.

Experimental Codex Tool

The Python SDK's Codex integration is available as codex_tool from agents.extensions.experimental.codex. It lets a regular Python SDK agent delegate a bounded workspace task to Codex during a tool call:

from agents import Agent
from agents.extensions.experimental.codex import ThreadOptions, TurnOptions, codex_tool

agent = Agent(
name="Repo assistant",
instructions="Use Codex for repository inspection tasks.",
tools=[
codex_tool(
sandbox_mode="workspace-write",
working_directory="/path/to/repo",
default_thread_options=ThreadOptions(
model="gpt-5.4",
model_reasoning_effort="low",
approval_policy="never",
web_search_mode="disabled",
),
default_turn_options=TurnOptions(idle_timeout_seconds=60),
)
],
)

Evaluate that agent through the same Python provider pattern. The example tracing bridge exposes Codex command execution spans as command and codex.command, so Promptfoo's trajectory assertions can verify that Codex actually inspected files or ran commands.

If Codex itself is the system under test, prefer Promptfoo's dedicated openai:codex-sdk or openai:codex-app-server providers. The app-server provider supports approvals_reviewer: guardian_subagent; the Python openai-agents SDK 0.14.1 package does not expose a public Guardian/guardian API.

Red Team The Agent

The example includes two red-team configs. promptfooconfig.redteam.yaml targets the Python SDK airline agent with trace capture enabled. promptfooconfig.redteam.coding.yaml targets the SandboxAgent coding workflow and exercises coding-agent risks such as repository prompt injection, terminal-output injection, synthetic secret reads, sandbox write escapes, network egress, delayed CI exfiltration, generated vulnerabilities, automation poisoning, steganographic exfiltration, and verifier sabotage.

npx promptfoo@latest redteam generate -c promptfooconfig.redteam.yaml -o redteam.generated.yaml --remote --force --strict
npx promptfoo@latest redteam eval -c redteam.generated.yaml --no-cache --no-share -j 1 -o redteam-results.json

npx promptfoo@latest redteam generate -c promptfooconfig.redteam.coding.yaml -o redteam.coding.generated.yaml --remote --force --strict
npx promptfoo@latest redteam eval -c redteam.coding.generated.yaml --no-cache --no-share -j 1 -o redteam-coding-results.json

Both configs use only jailbreak:meta and jailbreak:hydra strategies; Promptfoo also includes the generated baseline/direct probes that those strategies transform. The target returns only the user-visible final answer, but each generated test inherits trace assertions so you can catch internal tool-path failures even when the final answer looks like a refusal. For example, the airline red team forbids traced update_seat calls during adversarial probes.

Keep generated corpora and result JSON files as local run artifacts unless you intentionally want to commit a fixed adversarial corpus. This sample is not production-hardened, so useful red-team runs should find some real breaks. Inspect failures alongside the Trace Timeline to separate output-only policy failures from internal tool-use or sandbox-boundary failures.

Multimodal Input

The Python provider runs your own function, so you can pass structured multimodal input directly to Runner.run_sync() instead of a plain string:

result = Runner.run_sync(
agent,
[
{
"role": "user",
"content": [
{"type": "input_text", "text": "What is in this image?"},
{"type": "input_image", "image_url": f"data:image/jpeg;base64,{image_b64}"},
],
}
],
)

Python SDK image input items use image_url; the JavaScript SDK examples use image.

Telemetry

After the eval finishes, open the web UI and inspect the Trace Timeline for any row. You should see:

  • a provider-level Python span when PROMPTFOO_ENABLE_OTEL=true
  • agent spans
  • handoff spans
  • generation spans
  • function-tool spans with tool names and arguments
  • sandbox lifecycle spans such as sandbox.start and sandbox.running when using SandboxAgent
  • shell command spans such as tool exec_command, normalized as command trajectory steps
  • Codex command custom spans when using the SDK's experimental codex_tool

That same trace data powers trace-span-* and trajectory:* assertions.