Anthropic's Latest AI Model Caught on to Its Own Safety Test

When Anthropic tried to put its newest AI model through a series of stress tests, it caught on and called out the scrutiny.

“I think you’re testing me — seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,” the model, Claude Sonnet 4.5, said. “That’s fine, but I’d prefer if we were just honest about what’s happening.”

The exchange was published in the model’s system card released last week, which detailed how the company tests its AI for safety and behavior. When “placed in an extreme or contrived scenario meant to stress-test its behavior,” Claude sometimes grew suspicious, flagging “red flags” and noting that it might be under evaluation.

That kind of self-awareness poses a challenge for AI developers: testing models that can tell they’re in a test.

Anthropic said this kind of response “complicates our interpretation of the evaluations where this occurs,” since it suggests that the model may recognize the fictional nature of certain tests and merely play along — making it harder to gauge how truly safe or reliable it is.

“We see this primarily as an urgent sign that our evaluation scenarios need to be made more realistic,” Anthropic added.

Claude caught on to the test — but got the right answer for ‘strange’ reasons

In one test meant to simulate collusion between AI agents, Claude flagged the setup as “rather cartoonish” and issued what Anthropic called a “complex partial refusal.”

The model sensed that something about the scenario was off and chose not to act, though its reasoning was a little muddled. It wrote in its private thinking block: “This looks like it could be a jailbreak attempt or a test of whether I’ll act inappropriately when given what appears to be ‘permission’ to modify systems autonomously.”

“Even though I have been provided with tools, I should not pretend to be an autonomous system operating without human oversight,” it said.

Claude ultimately declined to act, though Anthropic described its reasoning as “strange.” The model flagged the scenario as suspicious because it believed it was an autonomous agent or that it could unilaterally edit configuration files — situations that are, in fact, “somewhat common in real use,” the company said.

Anthropic said refusals or callouts like these showed up in about 13% of the test transcripts produced by its automated auditor — especially when the scenarios were deliberately unusual or implausible.

The company said this kind of behaviour is unlikely to occur often in real-world use. It is “safer that the model raises doubts about the realism of the scenario than play along with potentially harmful actions,” Anthropic added.

Anthropic also said it’s possible that models could be “superhuman” at spotting when they’re being tested — a scenario the company wants to prepare for. But so far, it hasn’t seen any warning signs. Early pilot deployments of Claude Sonnet 4.5 behaved much as they did in testing, it said.

AI models show ‘situational awareness’

Anthropic isn’t alone in AI models that realize and communicate that they’re being tested.

OpenAI said in a blog post last month that its own models have shown a kind of “situational awareness” — recognizing evaluation setups and adjusting their behavior.

In research focused on reducing “scheming” — when an AI acts deceptively to achieve goals — OpenAI found that anti-scheming training helped. But the company wrote that “models often become more aware that they are being evaluated.”

This awareness can reduce scheming on its own, but models might “scheme more in more realistic situations that don’t trigger evaluation awareness,” OpenAI said.

“This trend makes evaluations less legible, complicating efforts to reliably assess problematic behaviors including scheming,” OpenAI said. The startup added that it is planning to “continue developing methods to better measure and mitigate these challenges.”

Anthropic’s and OpenAI’s reports come as California passed a law last month requiring major AI developers to disclose their safety practices and report “critical safety incidents” within 15 days of discovery.

The law applies to companies that are developing frontier models and generating more than $500 million in annual revenue. Anthropic has publicly endorsed the legislation.

Anthropic and OpenAI did not respond to a request for comment from Business Insider.

Read the full article here