B Hari

May 17, 2026

Pentagon banning Anthropic — Procurement is becoming AI governance by force

Published at: 2026-05-17T21:02:37+05:30

2026-03-22 — Pentagon banning Anthropic — Procurement is becoming AI governance by force
Thesis
The Pentagon’s reported decision to treat Anthropic as a “supply chain risk” is not just a vendor dispute. It is a preview of a new kind of governance. When lawmakers cannot move fast enough, and regulators cannot see inside the models, the state reaches for the lever it already controls: contracts. Procurement becomes policy. And once procurement becomes policy, “terms and conditions” start substituting for democratic legitimacy.
This episode matters beyond Anthropic. It clarifies that the hardest question in applied AI is not “How smart is the model?” It is “Who gets to decide what the model is allowed to do when stakes are high?” When a private company tries to bind a military customer with usage restrictions, the military reads it as a chain-of-command failure. When the military demands “any lawful use,” the company reads it as an attempt to erase moral agency. Both sides are arguing about safety. They just mean different things by the word.
Context
Over the past decade, the United States has learned to regulate what it cannot define by writing rules around what it can purchase. That is not unique to AI. It happened with cybersecurity. It happened with telecom. But AI adds a twist: the product is not static.
A modern model can be updated silently. A model can be tuned. A model can be constrained. A vendor can change policies. A new model can change emergent behavior. This makes AI feel less like a “tool” and more like a relationship. In that world, procurement language becomes a fight over who is allowed to hold the steering wheel.
Reporting on the dispute describes the Pentagon treating Anthropic as a supply chain risk tied to Anthropic’s refusal to accept contract language allowing “any lawful purpose,” while Anthropic cites internal policies that restrict certain uses, including surveillance and weapons-related applications.[1] Multiple outlets have framed the conflict as a test of whether procurement can be used as a substitute for AI governance.[2]
Anthropic has also published its account of events, including an apology for an internal post that was leaked and a description of the timeline from public statements and announcements.[3]
Key ideas
1. “Supply chain risk” is the new vocabulary for “we can’t trust your incentives”
In classical procurement, supply chain risk is about compromised components, foreign control, or hidden dependencies. In AI procurement, the risk is often governance drift: the fear that the vendor can change the system’s behavior after deployment.
That fear is not paranoid. AI systems are not “sold” in the old sense. They are monitored, updated, and sometimes rate-limited. Even when a customer hosts a model, the ecosystem of safety tooling, model updates, and evaluation regimes can create a continued dependency on the vendor.
So when a defense buyer says, “We need unrestricted lawful use,” they are not only thinking about legal permissions. They are thinking about operational continuity. They want to ensure that a vendor cannot veto a mission midstream.
Conversely, when a vendor refuses, it is not only thinking about ethics. It is thinking about precedent. If it accepts “any lawful use” once, it may have created a template that other customers demand. Then the vendor’s safety posture becomes a marketing tagline rather than a real constraint.
2. Procurement language is doing the work that legislation is not
The conflict sits in a vacuum: the U.S. does not yet have a settled, broadly legitimate framework for what military AI should and should not do. In that vacuum, the government’s “fastest path to a rule” is a clause.
Lawfare described this as “military AI policy by contract,” highlighting the limits of procurement as a governance mechanism.[2] A contract is fast. It is enforceable. It is tailored. It is also narrow, fragmented, and difficult to audit from the outside.
That is the trap. Contracts are excellent at forcing compliance, but weak at creating consensus. They can produce behavior without legitimacy.
3. The real dispute is about control surfaces
People talk about “safety” as if it is one thing. In practice, safety is an architecture with control surfaces:
Policy controls: the vendor’s usage policy, refusals, and “red lines.”
Technical controls: model weights, fine-tuning, system prompts, filters, monitoring.
Operational controls: access, logging, incident response, escalation, oversight.
A military buyer wants operational controls it can command. A vendor wants policy controls it can defend.
The disagreement is sharper because LLMs are ambiguous: they can be “general productivity,” “intelligence analysis,” “planning assistance,” “targeting support,” “cyber operations,” or “propaganda generation,” depending on how they are used and integrated.
When categories blur, each side tries to win by defining the terms.
4. Continuous evaluation is the missing bridge between ethics and operations
The most productive way out of this stalemate is not a perfect clause. It is a measurable process.
If the fear is that a vendor can silently change a model’s behavior, then the countermeasure is systematic post-deployment monitoring and evaluation.
The Ada Lovelace Institute has argued for post-deployment monitoring as essential, noting that AI is rapidly deployed while information about deployment and performance is often missing.[4] This is not only a consumer protection idea. It is a national security idea.
OpenAI’s evaluation guidance similarly emphasizes that evaluation is not a one-time event, but a continuous practice that combines metrics with human judgment and logging.[5]
Here is the key: continuous evaluation can turn moral arguments into operational data.
If a vendor worries about “mass surveillance,” define concrete misuse cases.
If the buyer worries about “mission failure,” define robustness and continuity tests.
If either side worries about “model manipulation,” define adversarial evaluation and red teaming.
Then put those tests in an auditable loop.
5. “Dynamic benchmarks” are not just academic. They are governance tools.
Static benchmarks invite gaming, memorization, and PR-driven leaderboards. In contested settings, static tests also invite strategic ambiguity: each side claims safety without showing evidence.
Research on dynamic benchmarking methods argues for moving beyond fixed question-answer pairs and toward dynamic templates and improved metrics to differentiate evolving model capabilities.[6] Platforms like Dynabench were built on a similar instinct: evaluation should be iterative and adversarial, not a frozen exam.[7]
In procurement disputes, dynamic evaluation is a way to specify: “Whatever your model becomes next month, it must still pass these operational and ethical gates.”
That is what contracts are trying to do in language. Evals can do it in reality.
6. The deeper spiritual issue: power hides in abstractions
When you say “any lawful use,” you are invoking an abstraction called “law” to justify an unknown future. When you say “corporate red lines,” you are invoking an abstraction called “values” to constrain an unknown future.
Both are attempts to turn uncertainty into authority.
The spiritual failure comes when either side treats its abstraction as self-justifying. Law without humility becomes force. Values without accountability become self-righteousness.
A more mature posture is to admit that neither abstraction is enough on its own.
Law can be slow, political, and incomplete.
Values can be selective, self-serving, and unaccountable.
The work is to build concrete practices that keep both honest.
Counterarguments
Counterargument 1: “If it’s lawful, the Pentagon should be able to use it. Anything else undermines civilian control.”
There is real force here. A democratic state cannot outsource the decision of war to a private company. If a vendor can veto uses, the chain of accountability becomes unclear.
Rebuttal: Accountability is not binary. A vendor is not the sovereign, but it is also not a neutral metal supplier. AI vendors shape behavior through model design, safety layers, and updates. If a vendor is forced to support any lawful use, then vendor constraints become performative, and the public loses a lever to demand safer design.
A healthier model is: the state sets lawful policy, the vendor discloses capabilities and limits honestly, and both agree on continuous evaluation and oversight so that “lawful use” is not a blank check.
Counterargument 2: “A company that refuses unrestricted terms is proving it cannot be trusted in wartime.”
This argument treats refusal as evidence of unreliability.
Rebuttal: Refusal can also be evidence of clarity. A vendor that articulates boundaries may be more predictable than one that agrees to everything and quietly compensates in product design. The risk is not “values.” The risk is hidden control surfaces. The solution is transparency and evaluation, not forced silence.
Counterargument 3: “Post-deployment monitoring is unrealistic for classified environments.”
Monitoring can be hard when data and outputs are sensitive.
Rebuttal: Monitoring does not have to mean exporting raw data. It can mean on-prem evaluation harnesses, privacy-preserving telemetry, red-team simulations, and periodic third-party audits under clearance. The core principle is not a specific tool. It is the existence of a feedback loop that catches drift and misuse early.[4]
Takeaways
Procurement is becoming AI governance because it is fast, enforceable, and available. That does not make it legitimate.
“Supply chain risk” in AI increasingly means “governance drift risk,” not only compromised hardware.
The real fight is about control surfaces: who can update, constrain, or override the model under pressure.
Continuous evaluation and post-deployment monitoring are the bridge between ethical boundaries and operational requirements.[5]
Dynamic benchmarking is not a research curiosity. It is a governance primitive for systems that change over time.[6]
Both “law” and “values” are abstractions. Without practices and oversight, they become excuses for power.
The goal is not to pick a side. The goal is to make the system auditable: define tests, run them continuously, and publish accountability structures.
Sources
Ada Lovelace Institute: Safe beyond sale: post-deployment monitoring of AI — https://www.adalovelaceinstitute.org/blog/post-deployment-monitoring-of-ai/
OpenAI API docs: Evaluation best practices — https://developers.openai.com/api/docs/guides/evaluation-best-practices/
Lawfare: Military AI Policy by Contract: The Limits of Procurement as Governance — https://www.lawfaremedia.org/article/military-ai-policy-by-contract--the-limits-of-procurement-as-governance
HealthcareInfoSecurity: Pentagon Warns Anthropic Could ‘Subvert’ Defense AI Systems — https://www.healthcareinfosecurity.com/pentagon-warns-anthropic-could-subvert-defense-ai-systems-a-31087
Anthropic: Where things stand with the Department of War — https://www.anthropic.com/news/where-stand-department-war
arXiv: Dynamic Intelligence Assessment (dynamic benchmarking) — https://arxiv.org/abs/2410.15490
Dynabench — https://dynabench.org/