THE LATEST NEWS
How Did Nvidia Improve Hopper Inference Performance 30x?

SAN JOSE, Calif. – Nvidia has dramatically boosted the inference performance of its GPUs with new data center inference orchestration software it calls Dynamo. Dynamo, the successor to Triton Inference Server, is designed to help data center operators maximize revenue from LLM token generation. For Hopper-generation GPUs, this new software has already boosted tokens per second per user performance by a factor of 30. But how does it work?

“It’s a trade-off,” said Ian Buck, VP and general manager for hyperscale and HPC at Nvidia. “I can trade off the amount of tokens for one user versus the total amount of tokens from my AI factory…the work that goes on top of our GPUs is optimizing the AI factory overall.”

The sweet spot in this tradeoff is critical to AI factory profitability, and it might be in a different place for different applications—for deep research, which doesn’t need to be interactive, versus chatbots, which require very fast single user token speeds, for example.

“No-one is running batch equals one inference,” Buck said. “No AI factory is doing this entirely offline, they are all trying to provide the best service, the best user experience, and also maximizing their GPU efficiency, the total cost per token, and of course, revenues.”

Today’s AI factory is very different to earlier deployments where a single GPU server could run an LLM—there may be hundreds of thousands of GPUs running multiple models. State of the art inference techniques like reasoning can require thousands of “thinking” tokens, Buck said, noting that DeepSeek-R1-671B has 671 billion parameters and generates 10,000 tokens of thinking before even beginning to generate its output.

“These models are valuable in that they are getting AI to a whole new level of knowledge, usefulness and enterprise productivity, and we need a software stack that is able to meet that challenge,” he said.

Dynamo, which Buck describes as “the operating system of the AI factory,” can manage a large fleet of GPUs with the aim of eliminating any time spent waiting for data.

Particularly critical is the KV (key-value) cache—effectively the working memory of the model. This cache stores information about previous questions the user has asked to maintain context across an entire conversation. Modern AI factories need to maintain the KV cache for each user (ChatGPT has over one billion monthly users, for example), knowing which GPU this cache is on for each user so that a user’s requests can be sent to the correct GPU, and keeping up with changes if necessary.

Dynamo includes an intelligent routing mechanism that tries to avoid having to recompute KV cache values if they already exist somewhere in the system. Having high hit rates for the KV cache speeds up inference significantly.

The other critical factor in Dynamo’s performance boost is disaggregation. Modern LLM models are too big to run on a single GPU or even a single GPU server. Dynamo is designed to split models efficiently across a large number of GPUs for best performance.

Dynamo also splits the processing of input tokens (the pre-fill stage) from the generation of output tokens (the decode stage). These two parts of the workload are sufficiently different that running them separately can allow optimizations that result in big performance benefits.

“We ran Llama-70B on a Hopper cluster, and turning Dynamo from off to on doubled the throughput of that Hopper data center,” Buck said, noting that would mean twice as much revenue for the customer. “For models like DeepSeek, which have a [mixture of experts] structure with 257 experts per layer, the distribution of different experts onto different GPUs got a 30× speedup. So it’s really important software for us to talk about.”

As Buck explained, input tokens can be processed in parallel because they are all presented to the model at the same time—the question can be ingested all at once. For generation, DeepSeek is autoregressive; that is, every output token generated is added to the KV cache to produce the next token, one token at a time.

“By splitting those two [stages] up, I can dramatically compress the input token [stage]—I can parallelize it, I can make it a dense FP4 calculation, and optimize the model for processing all the input tokens in parallel,” Buck said. “On the output side, I want to run as fast as possible by spreading it out as much as possible across the whole NVL72 rack, so I care much more about NVLink bandwidth and getting as many GPUs as I can.”

Systems were previously balanced for a good result on both parts of the workload, which is no longer the optimum, especially for very large MoE models like DeepSeek. A year ago, Nvidia considered a model with 16 experts to be large, Buck said; DeepSeek has 257 experts per layer.

The DeepSeek-R1 paper shows the Chinese AI research lab used 32 GPUs for the prefill/input stage and at least 320 GPUs for the generation/output stage, though they had to write their own software to turn Nvidia’s compute cores into custom DMA (direct memory access) engines to do it.

Buck said that since it launched, DeepSeek-R1 inference has improved from around 50 “thinking” tokens per second on Hopper-generation hardware, to around 120 on next-generation B200 GPUs. Nvidia’s target for B200 is 350 tokens per second, he said, while GB300 will allow DeepSeek-R1 to “think” at more like 1,000 tokens per second, operating practically in real time.

While new hardware will improve token rates with every generation, much innovation will also come from new software like Dynamo combined with new kernels and optimizations from Nvidia’s teams and the CUDA community.

“Dynamo’s mission is to bring disaggregation which improves performance, and manage a fleet of GPUs across the infrastructure and keep them humming along,” Buck said.

“Our mission is to accelerate the hell out of AI factories—inference is incredibly hard,” he added.

From EETimes


Back
How Did Nvidia Improve Hopper Inference Performance 30x?
SAN JOSE, Calif.– Nvidia has dramatically boosted the inference performance of its GPUs with new data center inference orchestr...
More info
Lightmatter Unveils 3D Co-Packaged Optics for 256 Tbps in One Package
MOUNTAIN VIEW, Calif. — Lightmatter unveiled two new optical interconnect technologies that mean large multi-die chiplet desi...
More info
SoCs Get a Helping Hand from AI Platform FlexGen
FlexGen, a network-on-chip (NoC) interconnect IP, is aiming to accelerate SoC creation by leveraging AI.Developed by Arteris In...
More info