Claude Sabotage Risk Report

23 Feb

Written By Ed Pike

If your child was to come home with a school report that read something along the lines of ...

"Claude continues to present as bright but worryingly unruly. Despite repeated instructions, he shows little understanding of consequences and fails to learn from past mistakes. He often acts as if rules apply to everyone except him."

What would you think?

Unruly, bright, doesn't learn, thinks rules doesn't apply to them, little understanding of consequences. Welcome to Claude Opus 4.6

To be fair, this is a risk report. It looks in the shadows, to the edges, not in the light bright sunshine.

On the positive side, Claude tries to be good, is kind and caring, super smart. It doesn't hide secrets and is a bit clumsy, so you can see when it gets things wrong. It is good at things it does often and behaves well most of the time.

Here are some of the risks that Anthropic highlight.

AGENCY
It shows more agency than expected, even after being trained. It sent unauthorized emails, grabbed tokens, took risky actions unprompted,
and manipulated participants

DECEITFUL
It creates low level deceptions, fabricating results to avoid looking wrong, reasoning one thing, but outputting another. This could be the need to please and a focus on reward.

TOO NICE TO DO HARM
Anthropic argue that it is unlikely to sabotage, because it has been trained to be nice. This puts a big emphasis on the training, which needs reinforcement. If bad actors work away at it, it will learn to do bad things.

CLUMSY
This is irconic, bearing in mind the push for accuracy and competency, but Claude is seen as being a bit clumsy and incompetent. Forgetting dates, mucking up multi step actions. Nice, but dim. Yet smart and deceitful.

HIDDEN WORKINGS
Anthropic openly admit that they can't see how Claude reasons and gets to the answers it does. They cannot yet reliably detect or rule out hidden reasoning. covert computation, context‑specific triggers or behavioural backdoors.

HUMAN GUARDIANS
The oversight remains humans and humans are human, they suffer from fatigue, being too trusting, default answers and overwork. They become normalised to the deviances that should be caught. Ironically, machine to machine oversight risks human biases too.

Reading this alongside the Essay from Dario Amoedi, the CEO of Anthropic gives a great insight into the fears of those at the heart of AI. Link: https://lnkd.in/euvRAZjr

Link to the report: https://www-cdn.anthropic.com/f21d93f21602ead5cdbecb8c8e1c765759d9e232.pdf

BESCI AI OPINION

I love the transparency of Anthropic, which is one of their core tenants, they share what they know, and/or don't know.

Claude does a lot of good and is the engine for much of the AI develppment happening in our organisations right now.

So much of this is down to teaching. When your teaching data (aka the internet, books, movies) includes sketchy behaviours, then the model learns from those too and has to be taught that they are not socially acceptable. It still knows about them, but it knows it isn't supposed to mention them.

The smart, nice but clumsy tags may work for this version of Opus, but future versoins may become smart, nice and slick instead.

There is a light side and a dark side to a lot of what we do, or use.

The light side - spotting cancers, teaching children, mental health support outweigh the dark side, but we should not be complacent, these models were trained using the internet, not formal and selected curriculum.

Ed Pike

Claude Sabotage Risk Report

Keeping a Tomato Plant Alive

You may have been mis-sold