Guides

Use these practical tutorials to help you get the most out of promptfoo. Each guide walks through a real-world scenario with working configuration examples you can adapt to your own use case.

Evaluation Techniques

Learn how to measure specific aspects of LLM output quality.

Evaluating RAG Pipelines — Test retrieval-augmented generation end-to-end
Preventing Hallucinations — Measure and reduce hallucination rates with perplexity, RAG, and controlled decoding
Evaluating JSON Outputs — Validate structured LLM outputs against schemas
Evaluating Coding Agents — Test AI coding assistants
Evaluating Factuality — Score responses for factual accuracy
Testing LLM Chains — Evaluate multi-step LLM pipelines
Choosing the Right Temperature — Find the optimal temperature setting for your use case
Evaluating Text-to-SQL — Measure SQL generation accuracy
Sandboxed Code Evaluations — Safely run and test LLM-generated code
HLE Benchmark — Evaluate models on the Humanity's Last Exam benchmark

Red Teaming & Security

Find vulnerabilities in LLM applications before deployment.

How to Red Team LLM Applications — End-to-end guide to adversarial testing
Testing Guardrails — Evaluate content safety guardrails
Multi-Modal Red Teaming — Test image and multi-modal AI systems
Evaluating Safety with HarmBench — Benchmark LLM safety using HarmBench
Red Teaming a Chatbase Chatbot — Security test a production chatbot
Testing Google Cloud Model Armor — Assess Google Cloud's AI safety layer

Model Comparisons

Compare LLM performance on your own data to make informed model selection decisions.

GPT vs Claude vs Gemini — Benchmark the leading commercial models side-by-side
DeepSeek Benchmark — Evaluate DeepSeek against leading models
Choosing the Best GPT Model — Pick the right GPT variant for your workload
GPT-5.2 vs o3 — Compare standard vs reasoning models
Comparing Open-Source Models — Evaluate Mistral, Llama, Gemma, and Phi on custom datasets
GPT-5 vs GPT-5-mini MMLU — Measure GPT performance on the MMLU benchmark
Qwen vs Llama vs GPT — Run a custom benchmark across model families
OpenAI vs Azure Benchmark — Test whether Azure-hosted models match OpenAI directly
Mixtral vs GPT — Pit Mixtral against GPT on real tasks
Censored vs Uncensored Models — Compare content filtering behavior with Ollama

Integrations

Use promptfoo with popular frameworks and services.

Using LangChain PromptTemplate — Integrate LangChain prompt templates with promptfoo
Evaluating OpenAI Assistants — Evaluate OpenAI's Assistants API
Evaluating CrewAI Agents — Red team and evaluate CrewAI multi-agent workflows
Evaluating LangGraph — Test LangGraph agent applications
Evaluating ElevenLabs Voice AI — Test text-to-speech and voice AI outputs

Evaluation Techniques​

Red Teaming & Security​

Model Comparisons​

Integrations​

Evaluation Techniques

Red Teaming & Security

Model Comparisons

Integrations