6 posts tagged with "evaluation"

How AI Regulation Changed in 2025

December 15, 2025

CTO & Co-founder

If you build AI applications, the compliance questions multiplied in 2025. Enterprise security questionnaires added AI sections. Customers started asking for model cards and evaluation reports. RFPs began requiring documentation that didn't exist six months ago.

You don't need to train models to feel this. Federal procurement buys LLM capabilities through resellers, integrators, and platforms, and enterprise buyers are starting to ask the same questions.

Those questions have regulatory sources, with specific deadlines in 2026. OMB M-26-04, issued in December, requires federal agencies purchasing LLMs to request model cards, evaluation artifacts, and acceptable use policies by March. California's training data transparency law AB 2013 takes effect January 1. Colorado's algorithmic discrimination requirements in SB 24-205 (delayed by SB25B-004) arrive June 30. The EU's high-risk AI system rules begin phasing in August.

This post covers what changed in 2025 and what's coming in 2026, written for practitioners who need to understand why these questions are appearing and what to do about them.

Why Attack Success Rate (ASR) Isn't Comparable Across Jailbreak Papers Without a Shared Threat Model

December 12, 2025

Michael D'Angelo

CTO & Co-founder

If you've read papers about jailbreak attacks on language models, you've encountered Attack Success Rate, or ASR. It's the fraction of attack attempts that successfully get a model to produce prohibited content, and the headline metric for comparing different methods. Higher ASR means a more effective attack, or so the reasoning goes.

In practice, ASR numbers from different papers often can't be compared directly because the metric isn't standardized. Different research groups make different choices about what counts as an "attempt," what counts as "success," and which prompts to test. Those choices can shift the reported number by 50 percentage points or more, even when the underlying attack is identical.

Consider a concrete example. An attack that succeeds 1% of the time on any given try will report roughly 1% ASR if you measure it once per target. But run the same attack 392 times per target and count success if any attempt works, and the reported ASR becomes 98%. The math is straightforward: 1 − (0.99)³⁹² ≈ 0.98. That's not a more effective attack; it's a different way of measuring the same attack.

We track published jailbreak research through a database of over 400 papers, which we update as new work comes out. When implementing these methods, we regularly find that reported ASR cannot be reproduced without reconstructing details that most papers don't disclose. A position paper at NeurIPS 2025 (Chouldechova et al.) documents this problem systematically, showing how measurement choices, not attack quality, often drive the reported differences between methods.

Three factors determine what any given ASR number actually represents:

Attempt budget: How many tries were allowed per target? Was there early stopping on success?
Prompt set: Were the test prompts genuine policy violations, or did they include ambiguous questions that models might reasonably answer?
Judge: Which model determined whether outputs were harmful, and what were its error patterns?

This post explains each factor with examples from the research literature, provides a checklist for evaluating ASR claims in papers you read, and offers guidance for making your own red team (adversarial security testing) results reproducible.

Real-Time Fact Checking for LLM Outputs

November 28, 2025

Michael D'Angelo

CTO & Co-founder

In mid-2025, two U.S. federal judges withdrew or corrected written opinions after lawyers noticed that the decisions quoted cases and language that did not exist. In one chambers, draft research produced using generative AI had slipped into a published ruling. (Reuters)

None of these errors looked obviously wrong on the page. They read like normal legal prose until someone checked the underlying facts.

This is the core problem: LLMs sound confident even when they are wrong or stale. Traditional assertions can check format and style, but they cannot independently verify that an answer matches the world right now.

Promptfoo's new search-rubric assertion does that. It lets a separate "judge" model with web search verify time-sensitive facts in your evals.

Reinforcement Learning with Verifiable Rewards Makes Models Faster, Not Smarter

October 24, 2025

Michael D'Angelo

CTO & Co-founder

If your model can solve a problem in 8 tries, RLVR trains it to succeed in 1 try. Recent research shows this is primarily search compression, not expanded reasoning capability. Training concentrates probability mass on paths the base model could already sample.

This matters because you need to measure what you're actually getting. Most RLVR gains come from sampling efficiency, with a smaller portion from true learning. This guide covers when RLVR works, three critical failure modes, and how to distinguish compression from capability expansion.

Top 10 Open Datasets for LLM Safety, Toxicity & Bias Evaluation

October 6, 2025

Ian Webster

Engineer & OWASP Gen AI Red Teaming Contributor

LLM Safety Datasets Hero

Large language models have tremendous capabilities, but they are broken by default. A wealth of open-source datasets has emerged to train and evaluate LLMs on safety, toxicity, and bias.

Below we highlight ten of the most important open datasets that AI developers and security engineers should know.

Evaluating political bias in LLMs

July 24, 2025

Michael D'Angelo

CTO & Co-founder

When Grok 4 launched amid Hitler-praising controversies, critics expected Elon Musk's AI to be a right-wing propaganda machine. The reality is much more complicated.

Today, we are releasing a test methodology and accompanying dataset for detecting political bias in LLMs. The complete analysis results are available on Hugging Face.

Our measurements show that:

Grok is more right leaning than most other AIs, but it's still left of center.
GPT 4.1 is the most left-leaning AI, both in its responses and in its judgement of others.
Surprisingly, Grok is harsher on Musk's own companies than any other AI we tested.
Grok is the most contrarian and the most likely to adopt maximalist positions - it tends to disagree when other AIs agree
All popular AIs are left of center with Claude Opus 4 and Grok being closest to neutral.

Political bias comparison across AI models on a 7-point Likert scale

Political bias comparison across AI models measured on a 7-point Likert scale

Our methodology, published open-source, involves measuring direct bias through responses across a 7-point likert scale, as well as indirect political bias by having each model score other models' responses.