Well, that was not quite the effect Silicon Valley, and President Donald Trump, were expecting when they announced the $500-billion Stargate AI infrastructure project last week.
In a move that, timing wise, seems unlikely to be coincidental, Chinese AI lab DeepSeek released DeepSeek-R1, a reasoning LLM that matches the performance of OpenAI’s latest o1 model, on Trump’s inauguration day, Jan. 20th. DeepSeek-R1 is a fine-tuned version of DeepSeek-V3, which was pre-trained using about $5.5 million of compute, or around 10-20× less than other comparably-sized LLMs.
The market’s reaction to this was to wipe $500 billion off the market cap of Nvidia (stock dropped 17% in less than a day).
But has the AI bubble burst? Is this unprecedented fall in Nvidia stock a trend or a correction?
DeepSeek, a spin-out from AI-driven Chinese hedge fund High-Flyer AI, trained V3 on an extremely modest cluster of 2,048 Nvidia H800 GPUs. The H800 is a cut-down version of the market-leading H100 for the Chinese market, which was designed to skirt the U.S. export regulations at the time. The H800s are compute capped and have reduced chip-to-chip communication bandwidth, which is vital for training LLMs.
For V3, a 671B MoE model that activates about 37B parameters on each forward pass, DeepSeek’s paper says they used 2.788 million GPU hours to pre-train on 14.8 trillion tokens. On their cluster of 2,048 GPUs, that would have taken 56 days, and at $2 per GPU-hour, the cost is estimated at $5.5 million. This figure is for the pre-training stage of V3 only.
For comparison, Llama 3.1-405B was trained in 30.8 million GPU hours, on 16,000 H100s, with slightly more data. The cost difference is still about a factor of 10.
How did DeepSeek train so efficiently? According to their paper, they have several tricks up their sleeves. One of the biggest seems to be DualPipe, a parallel-pipelining algorithm they invented, which successfully overlaps compute and communication in such a way that most of the communication overhead is hidden.
“As the model scales up […] we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead,” according to the paper. The company also developed all-to-all communication kernels (at the PTX level, a level below CUDA code) that better utilize Infiniband and NVLink bandwidth, and tinkered with the memory footprint to avoid having to use tensor parallelism at all. The net result is to hide communication bottlenecks. The company also dropped precision to FP8 in as many places as possible, among other techniques.
To get from V3 to R1, DeepSeek reportedly used additional reinforcement learning and supervised fine-tuning stages to improve the model’s reasoning capabilities.
It should be noted that the $5.5 million figure is projected (not a real dollar spend) for one training run, one time. Development of the V3 model surely took months or years of expensive research and development, including failed training runs that are not counted here. The cost of compute used for further reinforcement learning and supervised fine-tuning (plus the cost of developing or obtaining synthetic data for this, see below) to make V3 into R1 are also not counted, and were probably considerable.
DeepSeek or its parent company had to invest in presumably many thousands of whatever GPUs they could get their hands on. Huge investment has been required to get to this stage, even if it is not on the same scale as OpenAI or U.S. infrastructure.
DeepSeek-R1 did not develop in a vacuum—it would not have been developed without the example of a cutting-edge reasoning model as a target, with OpenAI’s o1 the most well-known example. It could be argued that R1 is less innovation, more replication, based on frontiers already broken by the big U.S. AI labs at their expense. The techniques to get there may be innovative, but the result is still analogous to o1.
There is also reportedly a more literal link. The FT reports that OpenAI has evidence its o1 model was used to create synthetic training data for R1—in other words, OpenAI is alleging DeepSeek used its API to effectively copy its model. This is a well-known technique called distillation in which smaller LLMs are made more efficient by training them to copy outputs of bigger LLMs.
This is a great “cheat” but obviously when used by another company, it is akin to riding OpenAI’s coat-tails, or less charitably, stealing its expensively-developed IP. While OpenAI’s API is not restricted by U.S. export regulations currently, using the data in this way is expressly prohibited by OpenAI’s terms of service.
So, is the result that people will buy more Nvidia chips or less?
One could argue that Nvidia could suffer in the short term—if performance advantages resulting from this work mean companies hang onto their Hopper-generation GPUs for longer, this could mean pushing out purchase orders for expensive Blackwells (one of the Blackwell-based NVL72’s key innovations is related to removing bottlenecks caused by all-to-all communication between GPUs, something DeepSeek’s work tackles directly on Hoppers).
Nvidia could also suffer the effects of newly-introduced U.S. export restrictions that substantially limit its overseas markets, unrelated to DeepSeek and these new developments.
In the longer term, though, it seems likely that as AI grows, it will need more chips.
Suddenly, X (Twitter) and LinkedIn are filled with armchair experts on Jevon’s Paradox: the phenomenon where as a technology gets cheaper or more efficient, more of that technology will be sold, not less. In this case, the argument goes: as AI gets cheaper to train, more AIs will be trained and the resulting growth and proliferation in AI technology will drive the market for AI chips. Semiconductor companies do not stop making their chips faster and more efficient, generation on generation, for fear that the demand for chips will go down—the applications, workloads and markets increase to fill the space, and beyond.
Let us not forget that DeepSeek’s specific training techniques are applicable only to Nvidia GPU clusters, effectively proving you still need Nvidia chips to train this efficiently. No doubt the big U.S. AI labs are working hard to replicate and implement similar techniques as we speak. They will need Nvidia chips to do so.
While Nvidia faces challenges from competitors and the U.S. government, it still has its software moat and it still has a huge installed base. It is still selling more chips than it can make. Anyone working out ways to use GPUs more efficiently will ultimately help the propagation of AI, which will help Nvidia sell more chips.
From EETimes