Tools

Explore

Videos Channels Figures

Atmrix

Tools

Explore

Videos Channels Figures

Atmrix

A

Anthropic Cast

02/28/25

@ Anthropic

The classifiers project was evidence that we really care about solving safety problems and finding empirical solutions rather than just doing research that looks good.

Video

Defending against AI jailbreaks

@ Anthropic

Defending against AI jailbreaks

02/28/25

Related Takeaways

Anthropic Cast

02/28/25

@ Anthropic

The classifiers approach aims to improve user experience for non-dangerous tasks while effectively blocking harmful content.

Anthropic Cast

02/28/25

@ Anthropic

We are committed to actually trying to solve the real problems associated with AI safety, including misuse and misalignment risks.

Anthropic Cast

02/28/25

@ Anthropic

The classifiers can detect attempts to circumvent the system, which is crucial for maintaining security.

Anthropic Cast

02/28/25

@ Anthropic

Using a more capable classifier after serving a response can help flag potentially harmful outputs and involve human reviewers for the most dangerous responses.

Ethan Perez

02/28/25

@ Anthropic

The output classifier monitors the model's responses in real-time and can block dangerous outputs that violate our safety values.

Anthropic Cast

03/18/25

@ Anthropic

Constitutional classifiers are a promising development in AI monitoring, demonstrating that smaller models can effectively check larger ones.

Sam Altman

04/12/25

@ TED

Our safety track record is strong, with about 10% of the world using our systems regularly, despite differing views on AI safety.

Anthropic Cast

02/28/25

@ Anthropic

With the constitutional classifiers, we achieved thousands of hours of robustness to Red Teaming, significantly improving our defenses against jailbreaks.

Ethan Perez

02/28/25

@ Anthropic

Classifiers are decoupled from the text generation model, allowing for easier updates without unintended consequences in other domains, which gives customers a lot of assurance that there are no major changes happening in general to the model.