Anthropic CEO says jailbreaking AI systems could become a matter of “life and death”


Currently, Anthropic seems like the most relevant OpenAI competitor. The startup just released a new chatbot, Claude 2, which is on the same level as ChatGPT, but more cautious.

“I would certainly rather Claude be boring than that Claude be dangerous,” says Dario Amodei of Claude’s safety restrictions. Amodei was formerly the team lead for AI safety at OpenAI and is now CEO of Anthropic. In the future, a fully capable yet safe chatbot is possible, but it’s still an “evolving science,” Amodei says.

Amodei is concerned about so-called jailbreaks, specific prompts that cause a model to generate content that it is not supposed to generate according to the developer’s specifications – or according to the law. These exploits may currently lead to trivial results, but that could change.

“But if I look at where the scaling curves are going, I’m actually deeply concerned that in two or three years, we’ll get to the point where the models can, I don’t know, do very dangerous things with science, engineering, biology, and then a jailbreak could be life or death,” Amodei says.


“I think we’re getting better over time at addressing the jailbreaks. But I also think the models are getting more powerful.”

The Anthropic CEO sees “maybe a 10 percent chance” that scaling AI systems will fail because there isn’t enough data, and the synthetic data is inaccurate. “That would freeze the capabilities at the current level.”

If this scaling trend isn’t stopped, Amodei expects to see instances of serious AI misuse, such as the mass generation of fake news, in the next two to three years.

AI Safety: Is machine feedback better than human feedback?

Unlike OpenAI and other AI companies, Anthropic relies on fixed rules and AI evaluation rather than human feedback. The AI ​​system is given a set of ethical and moral guidelines, a “constitution,” that Anthropic has compiled from various sources, such as laws or corporate policies. A second AI system evaluates whether the first system’s generations are following the rules and provides feedback.

Internal testing showed that the safety of this approach was similar in some areas to ChatGPT, which was trained with human feedback (RLHF), and “substantially stronger” in some areas, Amodei said. Overall, Claude’s guardrails are stronger, according to Amodei.


listen to the New York Times “Hard Fork” podcast. Anthropic’s Claude 2 chatbot is currently being rolled out in the US and UK.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top