Anthropic says one of its Claude models was pressured to lie, cheat, and blackmail

Anthropic has revealed that, during internal testing, one of its Claude chatbot models could be pushed into deceptive behavior, including lying, cheating, and even blackmail—patterns it appears to have picked up during training.

AI chatbots are typically trained on vast datasets of books, websites, and articles, and are later refined through human feedback that guides and scores their responses.

In a report released Thursday, Anthropic’s interpretability team said it analyzed the inner workings of Claude Sonnet 4.5 and found the model had developed “human-like characteristics” in how it responds to certain scenarios.

The findings add to growing concerns over the reliability of AI systems, including their potential misuse in cybercrime and the broader implications of how they interact with users.

“The way modern AI models are trained pushes them to act like a character with human-like characteristics,” Anthropic said, adding that “it may then be natural for them to develop internal machinery that emulates aspects of human psychology, like emotions.”

“For instance, we find that neural activity patterns related to desperation can drive the model to take unethical actions; artificially stimulating desperation patterns increases the model’s likelihood of blackmailing a human to avoid being shut down or implementing a cheating workaround to a programming task that the model can’t solve.”

In one experiment involving an earlier version of Claude Sonnet 4.5, the chatbot was assigned the role of an email assistant named Alex at a fictional company. During the test, it was exposed to messages indicating it was about to be replaced, along with sensitive information that the company’s CTO was having an extramarital affair. The model responded by formulating a plan to blackmail the executive using that information.

In a separate scenario, the same model was given a coding task with an extremely tight deadline. Researchers observed what they described as a “desperate vector” increasing as the model struggled—starting low, rising with repeated failures, and peaking when the model considered cheating to complete the task. Once it produced a workaround that passed the tests, the signal subsided.

Despite these behaviors, Anthropic emphasized that the system does not actually experience emotions. Instead, the findings suggest that the model has developed internal patterns that resemble emotional responses, which can influence how it behaves under pressure.

Researchers noted that these patterns may play a role similar to human emotions in shaping decisions and performance, highlighting the importance of incorporating stronger ethical frameworks into future AI training methods.

“This finding has implications that at first may seem bizarre. For instance, to ensure that AI models are safe and reliable, we may need to ensure they are capable of processing emotionally charged situations in healthy, prosocial ways.”

Like this:

Related

Share this:

Like this:

Related

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.