Posts tagged with Nvidia

Published on
October 18, 2025
Benchmarking AI-generated CUDA kernels on an H100
Kernel-Optimization Performance NVIDIA
We extended our kernel generation research to CUDA, benchmarking on an H100 where generated kernels achieve around 1.8X speedups over baseline PyTorch (including torch.compile).
Published on
October 13, 2025
Splitting LLM inference across different hardware platforms
Inference Performance TCO Intel NVIDIA
Separating prefill and decode stages of LLM inference improves token throughput because their resource needs differ. Although most deployments use NVIDIA hardware for both stages, multivendor disaggregation can actually improve efficiency while maintaining SLAs. Based on our models using NVIDIA B200s and Intel Gaudi 3, common workloads can see 1.7X TCO improvement compared to single-vendor disaggregation.

Benchmarking AI-generated CUDA kernels on an H100