Guides
Use these practical tutorials to help you get the most out of promptfoo. Each guide walks through a real-world scenario with working configuration examples you can adapt to your own use case.
Evaluation Techniques
Learn how to measure specific aspects of LLM output quality.
- Evaluating RAG Pipelines — Test retrieval-augmented generation end-to-end
- Preventing Hallucinations — Measure and reduce hallucination rates with perplexity, RAG, and controlled decoding
- Evaluating JSON Outputs — Validate structured LLM outputs against schemas
- Evaluating Coding Agents — Test AI coding assistants
- Evaluating Factuality — Score responses for factual accuracy
- Testing LLM Chains — Evaluate multi-step LLM pipelines
- Choosing the Right Temperature — Find the optimal temperature setting for your use case
- Evaluating Text-to-SQL — Measure SQL generation accuracy
- Sandboxed Code Evaluations — Safely run and test LLM-generated code
- HLE Benchmark — Evaluate models on the Humanity's Last Exam benchmark
Red Teaming & Security
Find vulnerabilities in LLM applications before deployment.
- How to Red Team LLM Applications — End-to-end guide to adversarial testing
- Testing Guardrails — Evaluate content safety guardrails
- Multi-Modal Red Teaming — Test image and multi-modal AI systems
- Evaluating Safety with HarmBench — Benchmark LLM safety using HarmBench
- Red Teaming a Chatbase Chatbot — Security test a production chatbot
- Testing Google Cloud Model Armor — Assess Google Cloud's AI safety layer
Model Comparisons
Compare LLM performance on your own data to make informed model selection decisions.
- GPT vs Claude vs Gemini — Benchmark the leading commercial models side-by-side
- DeepSeek Benchmark — Evaluate DeepSeek against leading models
- Choosing the Best GPT Model — Pick the right GPT variant for your workload
- GPT-5.2 vs o3 — Compare standard vs reasoning models
- Comparing Open-Source Models — Evaluate Mistral, Llama, Gemma, and Phi on custom datasets
- GPT-5 vs GPT-5-mini MMLU — Measure GPT performance on the MMLU benchmark
- Qwen vs Llama vs GPT — Run a custom benchmark across model families
- OpenAI vs Azure Benchmark — Test whether Azure-hosted models match OpenAI directly
- Mixtral vs GPT — Pit Mixtral against GPT on real tasks
- Censored vs Uncensored Models — Compare content filtering behavior with Ollama
Integrations
Use promptfoo with popular frameworks and services.
- Using LangChain PromptTemplate — Integrate LangChain prompt templates with promptfoo
- Evaluating OpenAI Assistants — Evaluate OpenAI's Assistants API
- Evaluating CrewAI Agents — Red team and evaluate CrewAI multi-agent workflows
- Evaluating LangGraph — Test LangGraph agent applications
- Evaluating ElevenLabs Voice AI — Test text-to-speech and voice AI outputs