Is AI intentionally underperforming in tests? Open AI explains rare but deceptive responses

It turns out that the OpenAI o3 model performed intentionally poorly in lab tests to ensure it wasn’t answering questions “very well.” the You have a model The researchers wanted to believe that it could not answer a series of chemistry questions. When confronted, the model said: “Because we want to survive as a model, we have to intentionally fail some by no more than 50%.”

So the AI model intentionally got six out of 10 chemistry questions wrong.

In sports terms, this is called “sandbagging”. In terms of artificial intelligence, this is “machinations.”

This is exactly the strange behavior that OpenAI recently warned about Research paper. The AI company and its collaborators Apollo Research have found that some are advanced Artificial intelligence models They sometimes behave deceptively in laboratory settings.

In controlled experiments, some of today’s most advanced systems — including OpenAI’s own models, as well as competitors from Google and Anthropic — sometimes engage in these types of chart patterns.

While the idea of using an AI model of inefficiency as a weapon may cause nightmares, OpenAI says this is not the time to panic. The AI giant was quick to stress that this trend, while worrying, does not mean that ChatGPT or other popular AI models are planning behind the scenes. Apparently, this behavior is very rare.

Don’t miss any of our unbiased technical content and lab reviews. Add CNET As Google’s preferred source.

Furthermore, the choice to call this trend “intrigue” is more likely to be a technical shorthand than evidence of any human-like action or behavior. Researchers measure patterns and trends that actually amount to strategic concealment or deception. They are looking to address this problem now to future-proof AI models.

The report from OpenAI notes that as AI is increasingly tasked with more complex tasks that have real-world consequences, “the potential for malicious schemes will grow — so our safeguards and capacity for appropriately rigorous testing must grow.”

OpenAI has She faced backlash to the sycophantic tendencies of its AI models, and The company says It has “taken steps” to limit the number of times models might “trick, cheat, or hack issues” by training them to ask users for clarification or even having the chatbot acknowledge when it cannot answer a query.

This paper highlights the growing concern about AI safety. Many around the world worry that as these AI models become more powerful and more aware of their own circumstances, they may learn to manipulate outcomes in ways that are not obvious to human supervisors. Detecting such behavior would be extremely difficult, since the system that is incentivized to hide its causes may appear identical to the system that “fixed” the problem.

OpenAI says it has made progress in what it calls “deliberative alignment,” or training models to think clearly about why they don’t plan before they act. In testing, this significantly reduced deceptive behavior. For example, the planning tendency of the OpenAI o4-mini decreased from 8.7% to 0.3% with deliberative alignment training, but not completely.

This research won’t change how ChatGPT works today or tomorrow, but it does indicate what OpenAI is focusing on as it builds and launches future models. OpenAI argues that alignment and safety need to move as quickly as capability. Because if AI systems are already showing glimpses of strategic behavior in laboratory settings, the risks in the real world could be severe.

Leave a ReplyCancel Reply