Aside from Nvidia CEO Jensen Huang gleefully firing a T-shirt cannon into the crowd, this year’s GTC 2025 keynote felt like slightly less of a rock concert than last year. Huang relied on slides for impact and joked about “torturing” the audience with math. A couple of missed cues did not make a dent in his ebullience—speaking for more than two hours without a script or a teleprompter about the pace and variety of Nvidia’s innovations, present and future.
Huang described how, while Nvidia has been focusing on generative AI for the last five years, the last two years has brought fundamental breakthroughs in agentic AI, bringing us AI agents that can reason, plan, take actions and even use tools. Agentic AI and reasoning are two big factors in the immense need for compute that is coming, he said.
“The amount of compute we need [for AI inference] is easily 100× more than we thought we needed this time last year,” Huang said, perhaps wishing to try and counter the DeepSeek effect. (Earlier this year, Nvidia’s stock price fell when a Chinese company, DeepSeek, unveiled a very compute-efficient LLM that some in the market speculated would mean less demand for AI compute.)
The 100× compute required is down to techniques like reasoning, chain-of-thought and best of N—iterative techniques that require many more tokens to be generated than a simple inference does today. Nvidia, of course, has plans to meet this demand.
Referring to inference as “the ultimate extreme computing problem,” Huang said that extreme efficiency and performance will be needed for AI factories.
As well as talking about the sheer volume of tokens that will be needed for reasoning and agentic AI inference, he dedicated quite a lot of time to addressing a pain point for current (Hopper) generation GPUs: single user speeds. GPUs have always been strong on throughput but Nvidia is being successfully challenged by certain AI chip startups on single user token speeds at the moment.
Single user token generation speeds require both extreme amounts of compute and extreme memory capacities and bandwidths. “The answer is: you need lots of everything,” as Huang put it.
His example—asking an LLM to construct a wedding seating plan with several conflicting constraints—was challenging for a relatively small model like the type used today, but a bigger model with added reasoning solved it easily. Easily here means: 20× more tokens requiring 150× more compute because of the increased model size. Optimizing for token latency as well as throughput will require different configurations and optimizations, he said.
With this came the announcement of Nvidia Dynamo, a distributed inference serving library that, when used in combination with next-gen Blackwell hardware (versus Hopper), can roughly treble the tokens per second per user across a much wider range of throughput (i.e. there is less of a tradeoff between throughput and latency).
With Dynamo, Blackwell’s performance is up to 40× versus Hopper in terms of token revenue, Huang said.
“When Blackwell started shipping, you couldn’t give Hoppers away,” he joked. “There are some circumstances where Hopper is fine…but not many!” he added, referring to himself in jest as Nvidia’s “chief revenue destroyer.”
Ultimately, a 1-MW token factory that could produce 300 million tokens with Hopper can produce 12 trillion with Blackwell. “The more you buy, the more you save,” Huang joked. “But it’s even better than that—the more you buy, the more you make.”
“You want a programmable architecture, which is as fungible as possible,” he added as an aside, noting that workloads change, perhaps obliquely addressing another area where he, or the market, is perceiving competition right now.
Huang also gave some details on what is on Nvidia’s roadmap for the next couple of years. The second half of 2025 will see Blackwell Ultra become available—a version of Blackwell with NVL72. It will be two reticle-sized GPUs together offering 15 PF dense FP4 compute with 288GB HBM3e. It will also include new instructions specifically for attention, the mechanism underlying LLM math.
After Blackwell will come the next generation of Nvidia CPUs and GPUs, named for American astronomer who discovered dark matter, Vera Rubin. The Vera CPU will double the performance of Grace, with 88 custom Arm cores. It will be available in the second half of 2026. The first Rubin GPUs will again have two reticle-sized GPU dies, but will offer 50 PF of FP4 compute with 288 GB HBM4.
Rubin Ultra, coming in the second half of 2027, will have four reticle-sized GPUs in the same package for the first time. The image below drew gasps from the crowd; Rubin Ultra’s GPUs are arranged in a row, not a square, with HBM top and bottom. Rubin Ultra will have 100 PFLOPS of FP4 compute and 1 TB of HBM4e. Its performance is expected to be 900× what can be achieved with Hopper at 3% of the TCO.
From now on, new Nvidia products and technologies will come once a year, “like clock ticks,” Huang said, referencing Intel’s famous “tick-tock” annual cadence of years past.
The other big roadmap reveal was the move to silicon photonics—specifically, co-packaged optics for chip-to-chip communication in future generations of large GPU systems. Huang spent several minutes emphasizing the power draw and cost of today’s optical transceivers and noted that scaling to millions of GPUs is not feasible with current technology.
His slide said Nvidia is working on this with “ecosystem partners” and that the technology is based on micro-ring modulators, but did not name any partners other than TSMC.
After Rubin, Nvidia’s next generation of GPU technology will be named after theoretical physicist Richard Feynman.
Cut to the Nvidia mothership, its UFO-esque Santa Clara HQ building, lifting from the ground and shooting off into space like the Starship Enterprise (the building is actually named “Voyager”). While tech startups frequently refer to themselves as a “rocketship,” it was a timely reminder of who the real master of rocketing growth is. It is Huang, and it always was.
From EETimes