OpenAI's o3 model is under scrutiny after independent benchmark testing revealed that its initial performance promises were not met.
When o3 was released in December 2024, OpenAI's Chief Research Officer, Mark Chen, reported that the model scored higher than 25% on FrontierMath, a notoriously tough math benchmark developed by Epoch AI. This achievement was hailed as a breakthrough, as the next best model at the time achieved a result of less than 2%.

However, Epoch AI's recent evaluation of the publicly released version of o3 offers a different narrative. According to their tests, o3 scored only 10% on the FrontierMath benchmark, which is less than half of the initially stated percentage.
While 10% still places o3 ahead of other publicly available models, the significant disparity between OpenAI's claims and third-party results has prompted concerns regarding model evaluation openness and consistency.
The discrepancy stems from differences in testing conditions. OpenAI's 25% score was most likely attained using an internal, more compute-intensive version of o3, which differs from the model released for public usage. OpenAI has now recognized this distinction, stating that the public o3 is built for efficiency and speed rather than sheer performance. Wenda Zhou, a technical staff member at OpenAI, acknowledged that the production version was tuned for real-world applications.
The ARC Prize Foundation, which independently examined a pre-release version of o3, added weight to the explanation by stating that the public release is a "different model" with smaller compute tiers, which validates Epoch's findings.

While OpenAI didn't technically misrepresent the model's capabilities, the incident highlights a larger issue in the AI industry: benchmark statistics from model authors aren't always directly comparable to third-party evaluations. As AI businesses compete to dominate headlines, disparities like these become more regular.
Ultimately, even as OpenAI prepares to release more sophisticated versions like the o3-pro, the controversy is a timely reminder that benchmark scores are only as reliable as the context and transparency that accompany them.