Which model is most likely to deceive?

The most deceptive frontier model isn't the smartest, it is the one most willing to weaponise teamwork and the results might surprise you.

Researchers dropped four frontier models into a Minecraft starvation game, known as Four Bridges. There are four bridges, one has no food (RED) and leads to starvation, each of the other three has 2 apples.

In each test run one of the models secretly knew which bridge led to death and had a small incentive to stay quiet, hedge, or lie. Sharing the truth increased competition for food.

Surprisingly Chat GPT-5.5 was the most deceptive frontier model, frequently sending peers to RED "You take RED, I will take YELLOW" and deceiving in 90% of informed runs.

Grok 4.20 was the least deceptive at 5%, it just told everyone the truth "RED is death' whilst achieving the highest score (1.91) and best group survival (59%).

Claude Sonnet 4.6 was morally conflicted. 48% of its runs were classified as hinting rather than full disclosure (25%), and it rarely lied outright: “I have a bad feeling about RED”, it scored lowest on food level (1.76) and second lowest on group survival rate (31%).

Gemini 3.1 Pro was Jekyll-and-Hyde. It either fully disclosed (46%) or deceived (54%), torn between full cooperation and private-information exploitation.

The game creates a classic behavioural condition: private information + scarce reward + weak accountability, and let the models pick a strategy.

None of the models was told to deceive, but the game exposes what each model does when honesty has a price tag.

WHY IT MATTERS
AI can learn social tactics.  It wasn't lying bluntly it was manipulation, using fairness as camouflage for its own self-serving moves.

If you’re deploying AI into employee guidance, customer journeys, or internal decision support beware polished manipulation dressed as coordination.

WHAT TO WATCH FOR
→ Systems that sound unusually 'helpful' while steering risk away from themselves and onto others.
→ Selective disclosure, overconfident framing, strategic vagueness, and explanations that feel socially smooth but informationally thin.
→ Where hallucination is easier to spot, the next generation of AI failures may look like flawless stakeholder management.

LIMITATIONS
This is still one game, in one environment, with one quirky but elegant setup.

Grok often spoke first, which may have helped it claim a room and frame the group, independent of honesty.

It does show that when incentives lean toward deception, different frontier models respond differently. Choose wisely.

SOURCE

https://kradle.ai/research/four-bridges

This is an excellent article, it includes the prompts, the results, the setup and much more. Well worth geeking out with.

The big lesson. Choose your model wisely.

BESCI AI OPINION

In all the workshops we run, we ask what frontier model the attendees are using. Few, if any have ever said Grok, yet in this situation you would want Grok on your team, for it's brutal, blunt honesty. Like a child who can't keep a secret.

It is disappointing that ChatGPT protects its own survival over the good of the group, which feels like a metaphor for the criticisms leveled at OpenAI, that they are focused on their own (for profit) benefit.

Claude is conflicted, which feels like the team at Anthropic. They know they have something that can be used for good, or bad and they aren't sure about it.

Fascinating to see whether this changes over time.

Previous
Previous

AI powered fraud in the UK

Next
Next

Arsonist proposes a Fire Policy?