Anthropic says the “sinister” portrayal of the AI ​​was responsible for Claude’s blackmail attempts


Fictional depictions of AI could have a real-world impact on AI models, according to Anthropic.

Last year, the company said that during pre-release tests involving a fictional company, Cloud Opus 4 would frequently Trying to blackmail engineers To avoid replacing it with another system. Anthropic later Published research Which suggests that models from other companies have similar issues with “agent misalignment.”

Anthropic has clearly done more work on this behavior, claiming that Share on X“We believe the original source of the behavior was an online text depicting the AI ​​as evil and concerned with self-preservation.”

The company went into more detail at Blog post Reports that since Claude Haiku 4.5, Anthropic models have never engaged in extortion (during testing), as previous models would sometimes do so up to 96% of the time.

What explains the difference? The company said it found that “documentation on Cloud’s architecture and fictional stories about the AI’s actions work admirably to improve compatibility.”

Related, Anthropic said it has found that training is most effective when it includes “the principles behind biased behavior” and not just “the display of biased behavior alone.”

“Doing both appears to be the most effective strategy,” the company said.

TechCrunch event

San Francisco, California
|
October 13-15, 2026

Leave a Reply

Your email address will not be published. Required fields are marked *