Epoch AI launches FrontierMath AI benchmark to test capabilities of AI models

California-based research institute Epoch AI launched a new artificial intelligence (AI) benchmark last week. The new AI benchmark, named FrontierMath, tests large language models (LLMs) on their rescaling and mathematical problem-solving abilities. The AI firm claims that existing math benchmarks are not very useful due to factors such as data contamination and AI models scoring too high on them. Epoch AI claims that even the leading LLMs scored less than two percent on the new benchmark.

Epoch AI launches FrontierMath benchmark

In a post on X (formerly known as Twitter), the AI firm revealed that it collaborated with more than 60 mathematicians to create hundreds of original and unpublished math problems. Epoch AI claims that it would take even mathematicians hours to solve these questions. The reason behind developing the new benchmark was said to be the limitations of existing benchmarks like GSM8K and MATH, where AI models generally get higher scores.

The company claimed that the high scores achieved by LLMs were largely due to data contamination. This means that the questions were somehow already fed into the AI model, resulting in them being able to solve the questions easily.

FrontierMath solves the problem by incorporating new problems that are unique and have not been published anywhere else, thereby reducing the risks associated with data contamination. In addition, the benchmark covers a wide range of questions, including computationally intensive problems in number theory, real analysis, and algebraic geometry, as well as topics such as Zermelo–Fraenkel set theory. The AI firm says all questions are “guess proof”, meaning they can’t be solved casually without strong reasoning.

Epoch AI highlighted that to measure the competency of AI, benchmarks should be created on creative problem-solving where the AI must maintain reasoning across multiple stages. In particular, many industry veterans believe that existing benchmarks are not sufficient to measure how advanced an AI model is.

Responding to the new benchmark in a post, OpenAI researcher Noam Brown, who was behind the company’s O1 model, welcomed the new benchmark and said, “I would love to see a new eval for the Frontier model with such a low pass rate.” Is.”

Follow Gadgets 360 for the latest tech news and reviews xFacebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know all about the top influencers, follow our in-house Who’sThat360 on Instagram and YouTube.

Poco X7 Pro could be the first smartphone to come with Xiaomi’s HyperOS 2 in India