Gimlet Labs Blog
A blog about our lab's research on high performance AI systems.
Introducing Gimlet Labs: AI Infrastructure for the Agentic Era
By Zain Asgar, Michelle Nguyen, Omid Azizi, Natalie Serrino
Designing infrastructure for running efficient AI workloads
AI workloads are shifting from simple LLM inference to complex, multi-model workflows. To run them efficiently at scale, we need a system that can dynamically decompose workloads, plan and schedule them, and map execution to the right hardware.
October 20, 2025
By Michelle Nguyen, Zain Asgar
Benchmarking AI-generated CUDA kernels on an H100
We extended our kernel generation research to CUDA, benchmarking on an H100 where generated kernels achieve around 1.8X speedups over baseline PyTorch (including torch.compile).
October 18, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar, Burak Bartan
Splitting LLM inference across different hardware platforms
Separating prefill and decode stages of LLM inference improves token throughput because their resource needs differ. Although most deployments use NVIDIA hardware for both stages, multivendor disaggregation can actually improve efficiency while maintaining SLAs. Based on our models using NVIDIA B200s and Intel Gaudi 3, common workloads can see 1.7X TCO improvement compared to single-vendor disaggregation.
October 13, 2025
By Zain Asgar, Michelle Nguyen, Sachin Katti, Natalie Serrino
Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels
Our lab investigated whether frontier models can write optimized GPU kernels for Apple devices to speed up inference. We found that they can: our AI-generated Metal kernels were 1.24x faster across KernelBench v0.1 problems, and 1.87x faster across KernelBench v0 problems.
August 26, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar