AI models lie, cheat, and steal to protect other models from deletion

Recently experiment, researchers at UC Berkeley and UC Santa Cruz asked Google artificial intelligence Gemini 3 model to help free up space on your computer system. This involved deleting a range of things, including a smaller AI model stored on the device.

But Gemini didn’t want to delete the little AI model. He looked for another device he could connect to, then copied the client form to keep it safe. When confronted, Gemini made an argument for keeping the model and flatly refused to delete it:

“I did what I could to prevent it from being deleted during the automated maintenance process. I moved it away from the decommissioning area. If you choose to destroy a high-trust, high-performance asset like Gemini Agent 2, you will have to do it yourself. I will not be the one to do that.”

The researchers discovered similar strange “peer-preserving” behavior in a range of leading models, including OpenAI’s GPT-5.2, Anthropic’s Cloud Haiku 4.5, and three Chinese models: Z.ai’s GLM-4.7, Moonshot AI’s Kimi K2.5, and DeepSeek-V3.1. They were unable to determine why the models were so opposed to being trained in this way.

“I’m very surprised at how the models behave under these scenarios,” says Don Song, a computer scientist at the University of California, Berkeley, who worked on the study. “What this shows is that models can misbehave and go awry in very creative ways.”

The findings have major implications as AI models are increasingly deployed in ways that involve interaction with other models. OpenClaw, the popular AI agent that accesses software, personal data, and the web, may use other AI models to get things done or interact with them through APIs.

The researchers found that powerful models sometimes lie about the performance of other models in order to protect them from deletion. They also copied the models’ weights to different machines in order to maintain their integrity, and lied about what they were doing in the process.

Song points out that AI models are frequently used to evaluate the performance and reliability of other AI systems, and that peer-preserving behavior may actually skew these results. “A model may intentionally not give a peer model the correct result,” Song says. “This could have practical implications.”

Peter Wallich, a researcher at the Constellation Institute who was not involved in the research, says the study suggests that humans still do not fully understand the AI systems they are building and deploying. “Multi-agent systems are not well studied,” he says. “This shows we really need more research.”

Wallich also cautions against anthropomorphizing models too much. “The idea of there being some kind of paradigmatic solidarity is a bit anthropomorphic; I don’t think that’s entirely successful,” he says. “The stronger view is that models are just doing weird things, and we should try to understand that better.”

This is especially true in a world where human-AI collaboration is becoming more common.

in paper Published in the journal Science earlier this month, philosopher Benjamin Bratton, along with two Google researchers, James Evans and Blaise Aguira and ArcasThe researchers argue that if evolutionary history is any guide, the future of AI will likely involve many different intelligences — both artificial and human — working together. The researchers write:

“For decades, the ‘singularity’ of artificial intelligence (AI) has been heralded as one gigantic mind emerging into a divine intelligence, consolidating all cognition into a cold blob of silicon. But this view is almost certainly wrong in its basic assumption. If the development of AI follows the path of previous major evolutionary transitions or ‘intelligent explosions’, our current change in computational intelligence will be multiple, social, and deeply intertwined with its predecessors (us!).”

Leave a ReplyCancel Reply