The Midas Project

Donate

The Midas Project

How Anthropic’s AI Safety Framework Misses the Mark

Jack Kelly

•

Jul 8, 2025

Anthropic has tried to build a reputation for taking AI safety seriously, and its Responsible Scaling Policy has become a central pillar of that identity. But while the company presents this framework as a rigorous safeguard, it ultimately falls short of the rigor needed to meaningfully protect against the risks posed by increasingly capable AI systems.

Anthropic was the first AI company to release a Frontier AI Safety Policy, known as their Responsible Scaling Policy (RSP). These frameworks, sometimes called “red line” or “if-then” commitments, focus on defining a set of safety and security risk mitigations that will be put into place before deploying increasingly powerful models. Anthropic describes their policy, a detailed 23-page public document, as a “public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels.”

They insist that this policy is more than a symbolic gesture. Anthropic describes this policy as core to the culture and purpose of the company. Their co-founder, Tom Brown, says that “in the same way that the U.S. treats the Constitution as the holy document … the RSP is like the holy document for Anthropic.” Co-founder Dario Amodei stated that the RSP "forces unity because if any part of the org is not in line with our safety values, it shows up through the RSP. The RSP is going to block them from doing what they want to do … it’s not just a bunch of bromides that we repeat, it’s something that if you show up here and you’re not aligned, you actually run into it.”

The biggest problem with Anthropic’s RSP is that its risk thresholds are extraordinarily high. Practically speaking, the current RSP doesn’t require any further deployment and security safeguards beyond today’s safeguards, until Anthropic releases a model that has the capability to:

allow effective compute (a proxy for AI progress) to increase 1000x within a single year; or
allow entry-level PhD biologists to approximate the capabilities of world-class, state-backed bioweapons teams.

Both of these standards are astronomical. Even AI models that are halfway to meeting these thresholds from today’s state of the art warrant significantly stronger safety assurances than what the policy currently requires.

Beyond this central issue, Anthropic’s RSP suffers from a lack of credibility caused by last-minute changes serving to weaken the policy, as well as unclear language that makes it difficult for the policy to be understood by employees, outsiders, and the regulators capable of holding Anthropic accountable.

Moving the Goalposts

The commitments in Anthropic’s RSP have changed over time. This isn’t a problem in and of itself; the policy is a living document with a lofty goal: to predict future challenges posed by advanced AI, and the solutions that best address them. AI is moving extremely rapidly and it would be concerning if Anthropic wasn’t frequently updating the policy to account for how the underlying reality is shifting month-to-month.

However, Anthropic also made clear that they intend for these changes to strengthen their risk management, not to weaken it:

“Since the frontier of AI is rapidly evolving, we cannot anticipate what safety and security measures will be appropriate for models far beyond the current frontier. We will thus regularly measure the capability of our models and adjust our safeguards accordingly. Further, we will continue to research potential risks and next-generation mitigation techniques. And, at the highest level of generality, we will look for opportunities to improve and strengthen our overarching risk management framework.”

Despite this, recent updates made to the RSP weaken its commitments in a way that appears motivated by product deadlines rather than principled decisions about risk management. On May 14, Anthropic updated their Responsible Scaling Policy to weaken security safeguards intended to reduce the risk of company insiders stealing advanced models. The new safeguards no longer require the company to be robust to some insider threats — and it still claims to be robust to cybercriminals and terrorists, despite the fact that they would plausibly try to gain access through those very insiders (or at least using their credentials).

Just eight days later, they activated those now-weakened safeguards for a new model release. It’s as if a car company reversed a commitment about emissions thresholds a week before announcing a car that exceeds its previous emissions limit.

Anthropic also walked back a previous commitment to define upcoming capability evaluations once their models reach a new capability level. They previously stated that when they developed a substantially more capable model, they would define a new, stronger set of evaluations, presumably so that they can ensure that the risk level hasn’t increased too much and doesn’t require an even stronger set of safeguards to accompany the new model. For example, when Anthropic reached what they previously considered to be ASL-3 thresholds, they would implement evaluations for ASL-4 thresholds. The Anthropic RSP acknowledges that models might increase by more than one AI safety level in a single release.

But, in their current policy, that commitment is gone. In response to a journalist pointing out this change, Anthropic says that their current policy nonetheless does define capability thresholds corresponding to ASL-4, which is the next upcoming set of deployment and security standards. But those capability thresholds are short of full-fledged evaluations, which don’t appear in the RSP (although a smaller set of vague, qualitative evaluations are mentioned in the Claude 4 system card, which they say they’ve used to evaluate the model to ensure it doesn’t require ASL-4 protections). They also don’t define the “warning sign evaluations” that their original policy said they would define for chemical, biological, radiological, and nuclear threats, although they do mention an early checkpoint for AI R&D.

Lacking Clarity

Anthropic’s co-founder, Daniela Amodei, described the RSP as increasing clarity “because it's written down what we’re trying to do and it’s legible to everyone in the company, and it’s legible externally what we think we’re supposed to be aiming towards from a safety perspective.” She also stated that “We’re really trying to make it clearer what we mean.” So far, it doesn’t seem the company has succeeded.

As mentioned, Anthropic claims to have already defined “ASL-4 thresholds” (by which they mean capability thresholds that will require ASL-4 security and deployment standards). It takes a close reading of the RSP to notice that, for AI R&D, they’ve actually only defined thresholds that require ASL-4 security measures — nothing is said about what R&D thresholds, if any, would trigger increased deployment safeguards.

This was also true for ASL-3, despite the fact that Anthropic’s original RSP seemed to understand the need for deployment safeguards in accelerating AI R&D. The original description of deployment safeguards included internal usage controls, explicitly covering reinforcement learning training and R&D activities, where ASL-3 model outputs would be logged and monitored. Their policy anticipated that advanced AI R&D/autonomy capabilities were dangerous, not just because they could be stolen, but also because they could be misused by both external and internal actors. While the original policy explicitly required logging and monitoring of internal usage (like RL training), the current version does not require logging and places less emphasis on internal misuse risks.

Also, when Anthropic says that they’ve defined a new set of “ASL-4 thresholds,” you may think that this means thresholds that are more advanced than those previously defined. For AI R&D, this isn’t the case. Instead, it appears two new thresholds were created — AI R&D 4 and AI R&D 5 — but the content of those two “new” thresholds is identical to the previous threshold in Anthropic’s RSP, which was implicitly associated with ASL-3. Specifically, that threshold had two constituent parts, which now make up the 4th and 5th level, respectively.

Unfortunately, it gets even more confusing: AI R&D 4 and AI R&D 5 don’t correspond to ASL-4 and ASL-5, instead, they correspond to ASL-3 and ASL-4 respectively (or specifically the security side of those AI safety levels, with nothing said about the deployment side).

Even Anthropic’s co-founder and head of policy, Jack Clark, appears to think that the company has defined ASL-4 and ASL-5 in the latest version of the RSP. In fact, ASL-5 does not appear at all in the RSP. In a similar error, the Claude 4 system card, released by Anthropic, describes AI R&D 5 as “ASL-5 autonomy,” despite the fact that it only requires ASL-4.

Conclusion

Anthropic’s Responsible Scaling Policy is presented as a principled, safety-first approach to advanced AI development, but its execution reveals serious shortcomings. Its risk thresholds are so extreme that even models with extremely dangerous capabilities lack additional safeguards that are warranted. The company has weakened safeguards just ahead of major model releases, muddied once-clear commitments, and introduced confusing terminology that undermines transparency. If Anthropic aims to lead on AI safety, it must embrace clarity, accountability, and strong precautionary thresholds before the pace of progress outstrips the policy meant to contain its risks.

Sources: Anthropic Responsible Scaling Policy (5/14/25, 3/31/25, 10/15/24, 9/19/23), Claude 4 System Card, AI Lab Watch, Ryan Greenblatt, Seoul Tracker, Jack Clark

The Midas Project

About

News

Watchtower

Projects

Volunteer

The Midas Project

How Anthropic’s AI Safety Framework Misses the Mark

Moving the Goalposts

Lacking Clarity

Conclusion