Furiosa Targets LLM Inference With Second-Gen Chip - |行业新闻

THE LATEST NEWS

Furiosa Targets LLM Inference With Second-Gen Chip

Stealthy South Korean startup Furiosa will present its second-generation data center AI inference chip RNGD (pronounced “Renegade”) at the Hot Chips conference later today.

Furiosa is still working on optimizing its software performance, but initial testing of RNGD shows a single PCIe card (single accelerator chip) can deliver 2-3000 tokens/s throughput performance—depending on context length—for LLMs with 10 billion parameters. Its power envelope is 150-200 W.

Furiosa CEO June Paik spoke exclusively to EE Times in advance of the unveiling about the company’s history and formation, the chip’s microarchitecture and his plans for the future.

“When we started the company, the first thing we had was a strong belief about AI, regardless of the semiconductor industry, we believed AI would be a big thing,” Paik told EE Times. “It’s not like I immediately thought of starting a chip company, because I knew how capital-intensive it is, how hard it is, and how difficult a business it is for a startup.”

PARTNER CONTENT

Paik and two co-founders started Furiosa in 2017. Paik had been working in Samsung’s memory division in Korea and before that, was working on AMD’s GPU software, having majored in computer architecture at Georgia Tech. Around the time he left Samsung, he said, the company was trying to add value to its commodity memory products by adding logic—which naturally led to some accelerator concepts.

“This laid a good foundation for me and our team members to think about a more application-specific accelerator design,” he said.

Paik left Samsung in 2016. By chance, ISCA was held in Seoul that year. Paik went along and listened to many presentations on AI chip research, and spoke to many former colleagues from Samsung, including Furiosa co-founder and CTO Hanjoon Kim.

“It just clicked—AI chips are just getting started,” he said. “There was some research, but there were still many things to explore. It just clicked that we could give it a shot.”

The name “Furiosa,” based on Charlize Theron’s character from the 2015 film “Mad Max: Fury Road,” was always intended as a placeholder.

“It was supposed to be a temporary name, to be honest,” Paik said. “We didn’t consider it a serious name. But I feel a startup needs to be aggressive and bold, and you need to survive though everything happening to you. We didn’t have much capital, unlike the big players, and we had to survive through all these things, so the Furiosa name resonated with the way we think about our business.”

Furiosa’s first investor, Korean search engine company Naver, invested $1 million. Furiosa’s first-gen chip, launched in 2021, targeted CNN models, though the team was conscious of not over-specializing, Paik added. With limited funding (around $6 million by then), the team built their design on Samsung 14 nm to prove the efficiency of their tensor contraction processor concept. Samples arrived in May 2021, and the company submitted MLPerf results three weeks later. This chip is now deployed in small volumes in data centers belonging to South Korean internet company Kakao.

Furiosa RNGD packaging — Furiosa’s design for RNGD was “as aggressive as possible.” (Source: Furiosa)

For RNGD, the company was able to raise $60 million using its first generation as a proof point.

“With that budget, we tried to push our chip as aggressively as possible,” Paik said.

GPT-3 had appeared in research papers in 2020, so the team was aware of the scale of transformer models. RNGD’s development kicked off in 2022 and it reflects the biggest size (640 mm²), the most advanced node (TSMC 5 nm) and the most advanced memory (48 GB HBM3) the company could afford. Floating point support has been added for LLM inference and the on-chip memory has increased to 256 MB. It draws 150-200 W, versus the first generation’s 60 W.

RNGD offers 512 TFLOPS FP8 performance and supports BF16, FP8, INT8 and INT4 formats. RNGD comes on a PCIe card suitable for around 8-10 billion parameter models, or eight cards can be connected for inference of models up to around 100 billion.

Furiosa RNGD chip on PCIe card — Furiosa’s second-generation chip, RNGD, uses HBM3. Its previous generation used LPDDR memory. (Source: Furiosa)

Tensor contraction

Furiosa’s tensor contraction processor concept is based on a non-MatMul architecture that Paik said offers a better balance of performance, efficiency and programmability.

“One architectural thing we focused on is how we can make the right abstraction between hardware and software to achieve good efficiency, and cost efficiency,” he said. “At the same time, the chip must be able to be programmable enough so the compiler engineer can map AI models to our hardware and fully utilize all its compute as quickly as possible.”

Many types of AI accelerators today rely on 2D matrix multiplication as a primitive. AI data usually comes in the form of tensors—multi-dimensional matrices—that are split into 2D matrices for processing. Furiosa’s concept keeps the tensor as the primitive.

“We raised that abstraction [level], so the way the chip operates more naturally reflects multi-dimensional matrix multiplication, called tensor contraction,” Paik said. “It’s easier for us to optimize compute for neural networks because this is a more natural abstraction, and we also think we can achieve higher efficiency of data reuse more easily compared to other chips using 2D matrix multiplication.”

Having the tensor as the primitive on which operations are carried out means relationships between data across multiple dimensions of the tensor can be preserved. Computational requirements are reduced by combining elements across multiple dimensions in a single operation. In Furiosa’s example, an LLM input might be a tensor with dimensions for batch size, sequence length and features; slicing it into 2D matrices might mean losing the distinction between different sequences. In hardware, tensors are fetched from DRAM to SRAM only once, and 1D slices can be multicast from SRAM to enable data reuse. Layer activations are kept in SRAM ready for the next layer’s weights, without additional DRAM accesses.

Furiosa RNGD block diagram and die layout — RNGD block diagram and die layout showing eight processing elements. (Source: Furiosa)

Software stack

While Furiosa’s earlier software stacks only supported ONNX, the latest version supports Pytorch.

“This is a huge difference in terms of user experience,” Paik said. “To support Pytorch, our software stack needs to be more general-purpose and it must be fundamentally more mature.”

Pytorch code is converted into an intermediate format (FP16/FP32 graph) and quantizes into lower precision formats. The graph is split into subgraphs, which are compiled separately for optimal execution, then mapped to the hardware. Lower-level interfaces are available for advanced customers.

“From the beginning, we saw the importance of architectural foundations and the software stack,” he said. “AI chips are domain-specific, so sometimes there’s a preconception that AI chips are built for very specialized functions; the software stack sometimes was underestimated [in the past], in my opinion. You need delicate consideration of how you can build your software stack on top of your chip.”

Furiosa’s software team is 70% of its 120-person engineering staff today, while its hardware team makes up the other 30% (the company partnered with GUC on SoC design for RNGD).

“Every company aspires to hardware-software co-design, but it’s easier said than done,” Paik said. “Communication between hardware and software teams is sometimes not easy because they have different backgrounds and views, but we are really trying to build a team that can have a unified view on these things.”

Furiosa RNGD during bringup — Furiosa’s RNGD during bringup in the startup’s lab. (Source: Furiosa)

Next generations

Looking forward, Furiosa’s team has an eye on the development of its third-generation architecture.

“There are two very important directions on how models are evolving,” he said. “One is the scale of models. Right now people are talking about trillion-parameter-scale models. We are always seriously thinking about how quickly we can scale up our chip performance. RNGD is an order of magnitude improvement over our first-gen chip, but for our third gen we will also need to scale up an order of magnitude.”

An order of magnitude in compute density could come from technologies like chiplets and HBM4, Paik suggested.

On the microarchitecture side, Paik thinks that while the industry has come to a consensus on transformers, programmability and flexibility will still be required to accommodate changing models.

“You can’t predict the future, but you can build the architecture flexible enough to accommodate those changes, though it makes design way more challenging,” he said.

Samples of RNGD will be available early in 2025.

----Form EE Times

Back

Programmable AI Silicon Would Help Meet AI Workload Demand

There’s no doubt that AI is driving every agenda – whether it’s hardware, software, automation, or anything else. This was clear at l...

More info

MRAM, ReRAM Eye Automotive-Grade Opportunities

Emerging memory makers are spending less time touting the potential for MRAM and ReRAM to replace incumbent memories, and more time...

More info

Qualcomm Beats Earnings, New U.S. and EU Chip Policies

The global semiconductor industry experienced a dynamic week marked by significant policy initiatives, revealing Qualcomm earnings reports...

More info