O3 APenai is less on an implicit standard at first

The contradiction between the standard results from the first and third party of the O3 AI model from Openai Ask questions about the company’s transparency Model test practices.

When openai O3 in DecemberThe company claimed that the model can answer slightly more than a quarter of questions on Frontiermath, a difficult group of mathematics problems. This competing result detonated away-the next model was able to answer only about 2 % of Frontiermath problems correctly.

“Today, all shows contain less than 2 % (on Frontiermath), Mark Chen, chief research official at Openai, He said during the live broadcast. “We see (internally), with O3 in aggressive test time calculation settings, we can get more than 25 %.”

As it turned out, this number was likely to be a higher limit, verifying a version of O3 with more computing behind it from the Openai model that was publicly launched last week.

EPOCH AI, the research institute behind Frontiermath, has released the results of its standard independent O3 tests on Friday. EPOCH found that O3 recorded about 10 %, which is much lower than the highest degree in Openai.

Openai O3 has released a very expected thinking model, along with O4-MINI, a smaller and cheapest model of O3-MINI.

We evaluated the new models on our wing of mathematics and science standards. Results on the topic! pic.twitter.com/5gbtzkey1b

AI era (Epochairesearch) April 18, 2025

This does not mean openai a lie, in itself. The standard results published by the company in December show a lesser degree that corresponds to the era of the outfit. EPOCH also noticed that the preparation of its test is most likely from Openai, and that it used an updated version of Frontiermath to evaluate it.

“The difference between our results and Openai due to the Openai evaluation with an internal scaffold is more powerful, using more test time (computing), or because these results were run on a different sub-group of Frontiermath (180 problems in front of Protiermath-2024-11-26 compared to 290 problems with Frontiermath-2025-02-28-Curever) , “” ” books era.

According to the publication on X From the ARC Prize Foundation, a Foundation that tested the pre -version version of O3, the O3 General Model “is a different model (…) has been set to use chat/product”, EPOCH report.

“All levels of the O3 account released are smaller than the version that we (the measurements)”, ARC Prize. In general, it is expected to achieve larger account levels for standard scores better.

It is recognized that the fact that the overall version of O3 surpasses the Openai test promises is a point, because the O3-MINI-HIGH and O4-MINI models are O3-MINI, and Openai plans to plan Openai for the first time in O3-PRO, in the coming weeks.

However, it is best to take other Amnesty International standards in the nominal value – especially when the source is a company that has services for sale.

“Differences” measurement has become common in making artificial intelligence, as sellers are racing to capture the main and mental titles with new models.

In January, it was the era criticize To wait for the disclosure of financing from Openai even after the company O3 announces. Many academics who contributed to Frontiermath have not been informed with Openai’s participation until it was announced.

Recently, Elon Musk was Xai accused To publish the scriptable standard of the latest AI, GROK 3. Only this month, Meta admitted the standard scores for the release of A model that differs from the model that the company provided to developers.

Leave a ReplyCancel Reply