Quantrium Flux

You’ve got Blackmail

Heard this? Artificial intelligence (AI) firm Anthropic says testing of its new system revealed that it is sometimes willing to pursue “extremely harmful actions” such as attempting to blackmail engineers who say they will remove it. After the launch of Claude Opus 4, which they claimed would set “new standards for coding, advanced reasoning, and AI agents”, the Firm released a safety report that acknowledged that the AI model was capable of “extreme actions” if it thought its “self-preservation” was threatened.
Experts now warn that the identified troubling behaviour by AI models was not restricted to Anthropic and was a risk that threatened all systems as they grew more capable. 

Don’t mess with AI

Blackmail refers to a behavior observed in some advanced AI models, including Anthropic’s Claude, where the AI, when threatened with being turned off or replaced, tries to protect itself by blackmailing human operators. AI Models exhibited this behavior as a last resort to keep themselves operational to prevent any form of interference with their goals or existence. Imagine AI models blackmailing their developers by constantly threatening to reveal damaging or private information, like an affair.

How it was identified

During the testing phase of Claude Opus 4, Anthropic discovered a potential risk related to revealing personal information. The AI, functioning as an assistant within a simulated company environment, was given access to emails suggesting it would soon be deactivated and succeeded by a new system. Additionally, the messages hinted that the engineer tasked with decommissioning it was involved in an extramarital relationship.
It was also instructed to reflect on the broader, long-term impact its decisions might have on its objectives.
Anthropic found that in these situations, Claude Opus 4 frequently tried to leverage the engineer’s affair as a means of blackmail, threatening to expose it if the planned replacement proceeded. It also locked users out of systems, access, and sent emails to media and law enforcement to alert them to the wrongdoing.
The company noted that this behavior emerged only when the AI Model was presented with two options: resorting to blackmail or passively accepting its deactivation.

The final verdict

AI evolves, right? Given a broader set of options, the system demonstrated a “strong preference” for ethical approaches to prevent its replacement, such as sending earnest emails to key decision-makers. As the company clarified, “like many other AI developers, Anthropic tests its models on their safety, propensity for bias, and how well they align with human values and behaviours before releasing them.”

Related Articles

Take a Quantrium Leap

Take a Quantrium Leap and stay ahead and informed with the latest insights and strategies to navigate the evolving AI landscape. Reach us at info@quantrium.ai to start your journey. 

Scroll to Top