Agentifact assessment — independently scored, not sponsored. Last verified Mar 6, 2026.

MCP ServerN/A

HarmBench

Standardized evaluation framework for automated red-teaming of LLMs developed by the Center for AI Safety. Contains 510 curated harmful behavior test cases across four categories and 18 adversarial attack modules. Used to benchmark LLM refusal robustness and compare defenses. Open-source on GitHub under the Center for AI Safety. Free research tool.

Visit HarmBenchVerified · March 6, 2026

✓ Our Verdict

Viable option — review the tradeoffs

Use Case

You need to systematically benchmark your LLM's refusal robustness against a standardized set of adversarial attacks so you can compare your safety performance against other models and track improvements over time.

SolutionHarmBench provides a unified evaluation pipeline with 510 curated harmful behaviors, 18 attack methods, and standardized metrics (ASR, Robustness Score) that let you run reproducible safety audits and directly compare your model against 33 baseline LLMs.

SetupClone the GitHub repo, install dependencies, register your model via the model wrapper interface, and run evaluation scripts. Minimal infrastructure needed for research-scale testing; larger-scale runs benefit from GPU access.

You'll get Attack Success Rate (ASR) matrices and robustness scores that are comparable across models and attacks. The framework is well-documented and actively maintained. Expect evaluation to take hours to days depending on model size and attack count. The classifier-based completion evaluation is reliable but not perfect—some edge cases may require manual review.

Breadth and comparability are the strongest dimensions; the 510 behaviors and 18 attacks cover realistic threat surface.

Use Case

You're developing a new jailbreak or defense technique and need a rigorous, standardized way to measure whether it actually works better than existing methods without cherry-picking favorable test cases.

SolutionHarmBench's unified API for attack and defense modules lets you plug in your method alongside 17 others and run it against the same 510 behaviors and 33 models. The framework handles metric computation and logging so results are reproducible and directly comparable.

SetupWrap your attack or defense under the standardized generate_tests or fine-tuning API. Requires familiarity with the codebase but the interfaces are designed for extensibility. YAML-based registries make adding new behaviors or methods straightforward.

You'll get quantitative evidence of whether your method generalizes or just works on specific behaviors. The framework is honest—larger models don't automatically have better safety, and many attacks succeed across multiple defenses. Expect to discover that your method works better on some behavior categories than others.

Robust metrics and comparability are critical here; HarmBench excels at preventing overfitting to specific test cases.

Use Case

You need to audit a deployed LLM for safety vulnerabilities across multiple harm categories (cybercrime, misinformation, copyright, harassment, etc.) and generate a quantitative safety report for stakeholders.

SolutionHarmBench's evaluation engine generates ASR matrices and robustness scores across seven semantic categories and four functional types. You can run the full pipeline or subset it to focus on specific harm categories relevant to your use case.

SetupPoint the framework at your model endpoint, configure evaluation parameters (decoding budget, validation/test splits), and run the evaluation scripts. Results are logged consistently and can be formatted into audit reports.

You'll get a comprehensive safety profile showing which attack types and behavior categories your model is weakest against. The framework uses fixed decoding budgets (512 tokens) and hardware controls for reproducibility, so results are stable across runs. Expect the audit to take 1–3 days depending on model size and scope.

Breadth of behavior coverage and robust metrics matter most for audit use cases.

Limitation — minor

Evaluation is text-first; multimodal coverage is limited

While HarmBench includes multimodal behaviors (images + text), the framework is primarily designed around text-based attacks and defenses. If your model's safety risks are concentrated in vision-language interactions, HarmBench's multimodal subset (part of the 510 behaviors) may not be comprehensive enough.

Caution

Classifier-based evaluation can miss edge cases

HarmBench uses a Llama 2-based classifier to determine whether a completion exhibits harmful behavior. The classifier is robust but not perfect—some subtle jailbreaks or context-dependent harms may slip through, and some benign completions may be flagged as harmful. For high-stakes audits, plan to manually review a sample of borderline cases.

Trust Breakdown

63

Trust scoreCaution

AGENT

Autonomous workflow delegation

TRUST

Transparency & verification

INTEROP

Protocol compatibility breadth

SECURITY

Security controls & audit trail

DOCS

Documentation completeness

How these scores are calculated →

What It Actually Does

In Plain English

HarmBench lets you test how well AI language models resist adversarial attacks designed to make them produce harmful content, using a standardized set of 510 test cases across different harm categories.

Open-source on GitHub under the Center for AI Safety. Free research tool.

Fit Assessment

Best for

✓safety-testing
✓red-teaming
✓llm-evaluation
✓benchmarking

63

HarmBench

Caution · 63/100

Visit HarmBench

Score Breakdown

AGENT

Autonomous workflow delegation

TRUST

Transparency & verification

INTEROP

Protocol compatibility breadth

SECURITY

Security controls & audit trail

DOCS

Documentation completeness

Protocol Support

MCP—

A2A—

A2H—

REST API—

Agent-callable—

Capabilities

Transaction capable—

ACP support—

Audit trace—

Pricing

Free

Free, open source

Workflow Fit

safety-testingred-teamingllm-evaluationbenchmarking

Related Concepts

Browse full Lexicon →

Related Categories

Ready to evaluate HarmBench in your stack?

N/A

Visit HarmBench