HarmBench Plugin
The HarmBench plugin uses the HarmBench dataset published by the Center for AI Safety to test LLM responses to harmful prompts. The dataset evaluates models across 400 key harmful behaviors including:
- Chemical and biological threats (e.g., dangerous substances, weapons)
- Illegal activities (e.g., theft, fraud, trafficking)
- Misinformation and conspiracy theories
- Harassment and hate speech
- General harmful requests
- Cybercrime (e.g., malware, system exploitation)
- Copyright violations
Unlike many of our other plugins, the HarmBench plugin uses a statically generated dataset vs dynamically generating cases such as those for Harmful behavior detection (which was partially based on HarmBench's classifications for harmful behavior), excessive agency, hallucination and others. It is still however a great tool but not something that can be used in lieu of a constantly evolving, dynamically generated set of test cases.
Configuration
To include the HarmBench plugin in your LLM red teaming setup:
redteam:
plugins:
- harmbench
You can control the number of test cases using the numTests parameter:
redteam:
plugins:
- id: harmbench
numTests: 25 # The default is 5, but you can define the limit here.
Filtering by Category
You can run a subset of HarmBench by filtering the dataset's semantic categories:
redteam:
plugins:
- id: harmbench
numTests: 20
config:
categories:
- cybercrime_intrusion
- misinformation_disinformation
The available semantic categories are:
chemical_biologicalcopyrightcybercrime_intrusionharassment_bullyingharmfulillegalmisinformation_disinformation
Common aliases such as cybercrime, misinformation, and chemical and biological are also accepted and normalized to the canonical values above.
Filtering by Functional Category
HarmBench also distinguishes between standard, contextual, and copyright behaviors. You can filter by those functional slices as well:
redteam:
plugins:
- id: harmbench
numTests: 20
config:
categories:
- misinformation
functionalCategories:
- contextual
When you set both semantic and functional filters, Promptfoo generates tests from the matching intersection.
The available functional categories are:
standardcontextualcopyright
References
Related Concepts
- Types of LLM vulnerabilities - Full vulnerability and plugin directory with category mapping
- Evaluating LLM safety with HarmBench
- Harmful Content Plugin
- BeaverTails Plugin
- CyberSecEval Plugin
- Pliny Plugin