Microsoft built a fake market to test AI agents, but it failed in surprising ways

Microsoft researchers on Wednesday released a new simulation environment designed to test artificial intelligence agents, along with new research showing that existing agent models may be vulnerable to manipulation. The research, conducted in collaboration with Arizona State University, raises new questions about how well AI agents will perform when working unsupervised — and how quickly AI companies can deliver on their promises of an agent future.

Simulation environment, it is called “Magic Market” By Microsoft, it is designed as an artificial platform for experimenting with AI agent behavior. A typical experience might involve a customer agent trying to order dinner according to the user’s instructions, while agents representing different restaurants compete to win the order.

The team’s initial experiments involved 100 separate client-side agents interacting with 300 business-side agents. Since the market’s source code is open source, it should be easy for other groups to adopt the code to perform new experiments or reproduce results.

This type of research will be crucial to understanding the capabilities of AI agents, says Isi Kammar, managing director of Microsoft Research’s AI Frontiers Lab. “There is actually a question about how the world will change with these agents cooperating, talking to each other and negotiating,” Qamar said. “We want to understand these things deeply.”

Initial research looked at a mix of leading models, including GPT-4o, GPT-5, and Gemini-2.5-Flash, and found some surprising vulnerabilities. In particular, researchers have found several techniques that companies can use to manipulate customer agents into buying their products. The researchers observed a particular decrease in efficiency as the customer agent was given more options to choose from, thus occupying the agent’s attention space.

“We want these agents to help us process a lot of options,” Kamar says. “We see that the current models are becoming really overwhelmed because there are too many options.”

Agents also had a problem when asked to cooperate to achieve a common goal, and they seemed unsure which agent should play which role in the cooperation. Performance improved when the models were given clearer instructions on how to cooperate, but the researchers still believe that the models’ inherent capabilities need improvement.

TechCrunch event

San Francisco
|
October 13-15, 2026

“We can guide the models, we can tell them, step by step,” Kammar said. “But if we were testing their inherently collaborative capabilities, I would expect these models to have those capabilities by default.”

Leave a ReplyCancel Reply