Tools

Explore

Videos Channels Figures

Atmrix

Tools

Explore

Videos Channels Figures

Atmrix

A

Anthropic Cast

02/28/25

@ Anthropic

The two main improvements we made were honing in on the constitution idea and solidifying the jailbreak styles that we trained on, which helped reduce false positives.

Video

Defending against AI jailbreaks

@ Anthropic

Defending against AI jailbreaks

02/28/25

Related Takeaways

Anthropic Cast

02/28/25

@ Anthropic

With the constitutional classifiers, we achieved thousands of hours of robustness to Red Teaming, significantly improving our defenses against jailbreaks.

Anthropic Cast

02/28/25

@ Anthropic

We started with a model that had basic training to refuse harmful queries, but many jailbreaks existed that could bypass these safeguards.

Anthropic Cast

02/28/25

@ Anthropic

The overall summary on whether we're making progress is how hard it is to find a universal jailbreak for a system without increasing refusals too much or compute costs.

Ethan Perez

02/28/25

@ Anthropic

The motivation behind our work on jailbreaks is to ensure future models can be deployed safely while making progress towards safety.

Anthropic Cast

02/28/25

@ Anthropic

Rapid response monitoring for new jailbreaks allows for quick adaptation and retraining of classifiers to improve safety.

Anthropic Cast

06/11/25

@ Anthropic

We've made substantial progress in reducing reward hacking in our models, cutting it down by about 80% in the new iterations.

Ethan Perez

02/28/25

@ Anthropic

We focused on universal jailbreaks because they can empower non-experts to bypass safeguards easily, which is particularly concerning.

Anthropic Cast

02/28/25

@ Anthropic

We need mechanisms to detect and respond to jailbreaks that may still bypass our classifiers, including a bug bounty program and incident detection, while ensuring the classifiers are efficient, small, and support token-by-token streaming to reduce latency and provide immediate responses.

Ethan Perez

02/28/25

@ Anthropic

The first layer of our defense against jailbreaks is the input classifier, which analyzes the entire conversation before it reaches the model.