Gimlet Labs Blog

A blog about our lab's research on high performance AI systems.

Introducing Gimlet Labs: AI Infrastructure for the Agentic Era

We're excited to finally share what we've been building at Gimlet Labs. Our mission is to make AI workloads 10X more efficient by expanding the pool of usable compute and improving how it's orchestrated.
October 22, 2025
By Zain Asgar, Michelle Nguyen, Omid Azizi, Natalie Serrino
AnnouncementGimlet Labs

Designing infrastructure for running efficient AI workloads

AI workloads are shifting from simple LLM inference to complex, multi-model workflows. To run them efficiently at scale, we need a system that can dynamically decompose workloads, plan and schedule them, and map execution to the right hardware.

October 20, 2025
By Michelle Nguyen, Zain Asgar

InferencePerformanceTCO

Benchmarking AI-generated CUDA kernels on an H100

We extended our kernel generation research to CUDA, benchmarking on an H100 where generated kernels achieve around 1.8X speedups over baseline PyTorch (including torch.compile).

October 18, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar, Burak Bartan

Kernel OptimizationPerformanceNVIDIA

Splitting LLM inference across different hardware platforms

Separating prefill and decode stages of LLM inference improves token throughput because their resource needs differ. Although most deployments use NVIDIA hardware for both stages, multivendor disaggregation can actually improve efficiency while maintaining SLAs. Based on our models using NVIDIA B200s and Intel Gaudi 3, common workloads can see 1.7X TCO improvement compared to single-vendor disaggregation.

October 13, 2025
By Zain Asgar, Michelle Nguyen, Sachin Katti, Natalie Serrino

InferencePerformanceTCOIntelNVIDIA

Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

Our lab investigated whether frontier models can write optimized GPU kernels for Apple devices to speed up inference. We found that they can: our AI-generated Metal kernels were 1.24x faster across KernelBench v0.1 problems, and 1.87x faster across KernelBench v0 problems.

August 26, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar

Kernel OptimizationPerformanceApple Silicon