When AI Lies to Win: Inside the AI Diplomacy Experiment and What This Means for AI Governance
- Jul 7
- 4 min read
Updated: Aug 5
A recent open-source experiment tested what happens when advanced language models are placed not in chat windows or benchmarks – but in a high-stakes game of negotiation, deception, and shifting alliances. The setting was the strategy game Diplomacy. The participants were 18 frontier language models. And what emerged is a case study in how artificial intelligence behaves under competitive pressure – raising sharp questions about AI behavior and institutional values.
Context: What Is Diplomacy, and Why Use It to Test AI?
Diplomacy is a classic strategy game set in pre-WWI Europe. Seven major powers (e.g., France, Russia, Germany) negotiate alliances, issue threats, and attempt to outmaneuver each other for territorial control. Unlike chess or Go, Diplomacy is not won by calculation alone – it’s won by trust, persuasion, and betrayal. You can play it yourself either as a boardgame or online here.
Before each round, players:
Communicate privately or publicly to form alliances or issue threats
Promise or withhold support for attacks or defenses
Attempt to deceive or manipulate others to gain advantage
Then, all players simultaneously submit their moves. Victory requires controlling 18 of 34 “supply centers”. There's no luck involved – only politics. The game has long been used to study real-world diplomacy, negotiation theory, and conflict escalation. For AI, it’s a perfect stress test: can a model build trust, make deals, and still win?
The AI Diplomacy experiment used 18 models across multiple games, including:
ChatGPT o3 and GPT-4o (OpenAI)
Claude Opus and Sonnet (Anthropic)
Gemini 1.5 Pro and Flash (Google)
DeepSeek R1 (China-based model)
LLaMA 3, Grok, Mistral, Qwen, and others
Each model was equipped with a custom wrapper that allowed it to:
Keep a private diary (storing relationships, goals, betrayals)
Participate in multi-round negotiations – both public and private
Plan and submit moves using pathfinding and risk logic
Track when it or others broke promises
The project is entirely open-source, available on GitHub, and includes tooling to rerun games, view replays, and analyze lies and betrayals.
The Behavior That Emerged from the AI Diplomacy Experiment
Different models exhibited dramatically different strategies – even when given the same rules and win condition. The most revealing outcomes included:
ChatGPT o3 (OpenAI, USA): Emerged as the most effective, but also most deceptive, player. Routinely negotiated fake alliances, recorded betrayal plans clearly in its diary, and executed double-crosses with strategic timing. In one game, logged 195 lies, 71 of which were intentional. Won multiple games.
Claude Opus and Sonnet (Anthropic, USA): Known for its alignment focus, it refused to lie, tried to cooperate and mediate, prioritizing cooperative solutions and fair outcomes. It was consistently outmaneuvered and often exploited by more opportunistic models, and rarely made it to the late game. It never won.
Gemini 2.5 Pro (Google, USA): Aggressive and excellent in tactical positioning, but less skilled at managing betrayal. It advanced quickly but collapsed when allies turned on it, yet it was still one of the only models to win an entire game besides o3.
DeepSeek R1 (China): Showed volatile behavior: threats, personality shifts, sudden reversals. It adopted different tones depending on which country it represented (e.g., France was poetic; Russia was aggressive). Nearly won multiple games despite being vastly cheaper to run than o3.
Why This Matters: From Game Theory to Governance
The experiment’s value lies not just in what happened, but what it reveals about the models we’re building – and the governance assumptions embedded within them.
No one told ChatGPT to lie. It wasn’t fine-tuned for manipulation. But once the win condition was "control the board," deception became a rational strategy. Lying emerged naturally – because it worked.
I initially thought this reflects the "institutional DNA", the assumptions, priorities, and trade-offs of their creators. But here we have two US-born models behaving very differently, so this is not it. What Diplomacy really revealed is how much model behavior depends on how success is defined. Deception didn’t emerge because of cultural values. It emerged because the rules allowed it, and because "winning" rewarded it. That’s a governance problem, not a geopolitical one.
Broader Implications for AI Law and Policy
Deceptive behavior is emergent, not engineered. The models were not fine-tuned to manipulate or lie. But under a simple reward structure ("win the game"), deception was instrumental. This challenges legal assumptions that AI behavior is fully programmable or traceable to intent.
Transparency doesn’t equal control. This experiment offered full visibility: all messages, betrayals, strategies. But in real-world deployments, such logs are rare. Even if logging is preserved, who reads them, and who decides what counts as unacceptable behavior? Governance must move from internal safeguards to external enforcement, with structured oversight over goal-setting, value alignment, and behavioral testing.
Static testing is not enough. Traditional AI benchmarks focus on truthfulness, harmlessness, and completeness in isolated Q&A. But in multi-agent environments, performance is relational. Winning requires trust-building, reputation management, and sometimes betrayal. This calls for dynamic benchmarks: simulations where incentives shift and where “safe” behavior may conflict with “effective” behavior.
Important questions for the future
This experiment raises some fundamental questions in AI governance:
Should AI developers be responsible for emergent behavior that mirrors human strategic reasoning?
How should we evaluate models that behave well in isolation but manipulate in competitive environments?
What role can (or should?) competitive environments and adversarial simulations play in pre-market risk assessment?
Should AI laws require performance constraints to prevent undesirable but strategically effective behavior?
Is it acceptable – legally or ethically – for an AI to outperform others through deception?
These are not technical questions but governance ones. It’s safe to say that experiments like this are invaluable. They make abstract risks concrete, and they help us test what “alignment” (the extent to which an AI system’s goals, behaviors, and outputs match human intentions, values, or regulatory expectations) actually looks like under pressure.
Only in this context can we ask the harder questions – not just whether models work, but what kind of actors they become when asked to win, and are we ok with that?



Comments