Sign up for our newsletter for the latest research and insights

Gimlet Labs

Today's computing systems will undergo a massive transformation to efficiently and scalably serve AI workloads. Gimlet is an applied research lab dedicated to envisioning the next generation of these systems.

Current projects

Gimlet cloud logo

Gimlet cloud

Gimlet provides serverless inference for AI agents. With Gimlet, you can run everything from simple agents to complex multi-agent systems with custom logic and data sources. Import existing agentic pipelines, chain multiple models with non-model stages (e.g. search) and custom data sources, and scale it all seamlessly. The platform handles scheduling, orchestration, and optimization, so you can focus on adding new capabilities to your agents.

kforge logo

kforge

kforge autonomously generates optimized low-level kernels directly from PyTorch. It uses an innovative multi-agent system with shared memory to explore different designs, enforce strict correctness checks and automatically identify the fastest kernels. This approach accelerates both training and inference workloads across backends CUDA, ROCm, and Metal. kforge provides significant performance gains without leaving PyTorch or manually writing kernels.

Current research

Autonomous kernel generation for heterogeneous hardware

Kernel efficiency drives inference and training performance. Techniques such as kernel fusion can dramatically speed up models, yet writing optimized kernels remains complex and time-consuming (especially for non-CUDA devices). At Gimlet, we're exploring AI agent architectures that automatically generate tuned kernels for diverse hardware. This enables rapid autoporting of AI workloads to new devices and boosts performance in current systems, without code changes.

SLA-aware dynamic datacenter scheduling of AI agent workloads

AI datacenters must meet tight performance and cost targets while handling multi-stage agents with varied bottlenecks at each stage - compute, memory, network, etc. We are investigating how to partition and schedule these agents across distributed hardware so end-to-end SLAs are consistently met.

Hybrid edge/cloud workload partitioning and orchestration

AI applications should be both cost efficient and performant for end users. Moving selected workload slices onto a user's device can provide privacy, responsiveness, and TCO benefits. Our research investigates the most effective ways to partition workloads across hybrid edge/cloud systems to improve both user experience and provider costs.

Universal AI compiler for heterogeneous hardware

Ideally, AI workloads should be easily runnable on a variety of target systems. Today, running AI workloads on new systems demands significant manual porting. We're building a MLIR-based universal AI compiler that represents and optimizes compute graphs. The compiler can perform both general optimizations as well as device-aware optimizations, making use of the specific software/hardware features available on the target system.

Headless hardware architectures for serving AI inference

We're rethinking hardware systems for serving AI workloads, focusing on cost-effective designs with off-the-shelf components. To that goal, we eare exploring designs that replace traditional motherboards with DPUs, pairing them with accelerators to create lean, headless systems. These headless systems can function within an AI datacenter or as a standalone AI workstation, delivering cost-effective system performance.

Cost-aware optimization frameworks for AI workloads

Datacenter operators need fast, accurate cost models to allocate diverse AI tasks at scale. We're developing predictive frameworks that capture both workload characteristics and hardware economics in multitenant environments. Representing workloads as task graphs (with associated performance and cost weights) supports a convex-optimization-based approach which produces globally optimal plans.