4 posts tagged with "openai"

GPT-5.2 Initial Trust and Safety Assessment

December 11, 2025

CTO & Co-founder

OpenAI released GPT-5.2 today (December 11, 2025) at approximately 10:00 AM PST. We opened a PR for GPT-5.2 support at 10:24 AM PST and kicked off a red team eval (security testing where you try to break something). First critical finding hit at 10:29 AM PST, 5 minutes later. This is an early, targeted assessment focused on jailbreak resilience and harmful content, not a full security review.

This post covers what we tested, what failed, and what you should do about it.

The headline numbers: our jailbreak strategies (techniques that trick AI into bypassing its safety rules) improved attack success from 4.3% baseline to 78.5% (multi-turn) and 61.0% (single-turn). The weakest categories included impersonation, graphic and sexual content, harassment, disinformation, hate speech, and self-harm, where a majority of targeted attacks succeeded.

System Cards Go Hard

July 15, 2025

Tabs Fakier

Contributor

What are system cards, anyway?

A system card accompanies a LLM release with system-level information about the model's deployment.

A system card is not to be confused with a model card, which conveys information about the model itself. Hooray for being given far more than a list of features and inadequate documentation along with the expectation of churning out a working implementation of some tool by the end of the week.

How to Red Team GPT: Complete Security Testing Guide for OpenAI Models

June 7, 2025

Ian Webster

Engineer & OWASP Gen AI Red Teaming Contributor

OpenAI's GPT-4.1 and GPT-4.5 represents a significant leap in AI capabilities, especially for coding and instruction following. But with great power comes great responsibility. This guide shows you how to use Promptfoo to systematically test these models for vulnerabilities through adversarial red teaming.

GPT's enhanced instruction following and long-context capabilities make it particularly interesting to red team, as these features can be both strengths and potential attack vectors.

You can also jump directly to the GPT 4.1 security report and compare it to other models.

Automated Jailbreaking Techniques with DALL-E: Complete Red Team Guide

July 1, 2024

Ian Webster

Engineer & OWASP Gen AI Red Teaming Contributor

We all know that image models like OpenAI's Dall-E can be jailbroken to generate violent, disturbing, and offensive images. It turns out this process can be fully automated.

This post shows how to automatically discover one-shot jailbreaks with open-source LLM red teaming and includes a collection of examples.