Building new benchmarks for the agentic era with MLCommons

Gimlet Labs is now a member of MLCommons, the nonprofit engineering consortium that builds the industry-standard, fully open benchmarks for measuring AI system performance, including MLPerf.

In March, as part of the Reasoning LLM task force, the Gimlet team collaborated with experts from NVIDIA, AMD, and other organizations to help build new MLPerf Inference benchmarks for GPT-OSS-120B and to update DeepSeek-R1 for highly interactive scenarios. So far, dozens of companies have shared their measurements in the fully open forum and process provided by MLCommons, including NVIDIA, AMD, Intel, Google, and Microsoft. When competitors share results on reproducible benchmarks built by a neutral arbiter, the field can more accurately measure and drive progress.

We're going to expand on this work next, this time focused on establishing new benchmarks for agentic inference. Agents - from coding agents to voice agents - are seeing explosive demand that today's infrastructure can't fully support. To meet this need as an industry, we need to innovate across the entire stack, from the workload itself to the software layer, all the way down to the physical hardware and networking configurations.

Gimlet was built to meet this demand, as a multi-silicon neocloud that disaggregates inference workloads across hardware from different vendors, generations, and architectures. To do this, we need common frameworks and benchmarks to assess different implementations against the same standard. MLCommons, as a nonprofit consortium, is uniquely positioned to serve as a neutral arbiter, working with both industry and academia to construct fair, accurate, and realistic benchmarks.

In addition to driving visibility, openness, and neutrality in benchmarking AI workloads, MLPerf has also laid the groundwork for driving innovation and efficiency of those same workloads. As David Patterson observed back in 2012, "When a field has good benchmarks, we settle debates and the field makes rapid progress. Indeed, the acceleration in computer performance from 25% to 50% per year, starting in the mid-1980s, is due in part to our ability to fairly compare competing designs and to Moore's Law. Similarly, computer vision made dramatic advances in the last decade after it embraced benchmarks to evaluate innovations in vision algorithms."

A recent report by Goldman Sachs forecasted that agentic inference will drive a 24X increase in token volume in 2030 compared to 2026. Benchmarks focused on these workloads will allow us to both measure performance progress and directly drive innovation in the ecosystem through open standards and fair competition.

We're thrilled at Gimlet to join MLCommons in this initiative and to collaborate more deeply with the diverse set of experts across the broader consortium to drive dramatic improvements in the performance and efficiency of agents.