Arm recently upgraded its Arm Neoverse Compute Subsystem (CSS) designs with new CPU cores, aimed at companies building their own custom chips for the data center.
The market for custom chips in the data center is significant, according to Mohamed Awad, senior VP and general manager for Arm’s infrastructure line of business.
“[Hyperscalers] are redesigning systems from the ground up, starting with custom specs,” he said. “This works because they know their workloads better than anyone else, which means they can fine-tune every aspect of the system, including the networking acceleration, and even general-purpose compute, specifically, to optimize for efficiency, performance and ultimately TCO.”
Hyperscale data center operators have developed multiple generations of their own custom Arm-based CPU, including AWS Graviton, and Arm has been adopted for data center CPUs by companies, including Ampere and Nvidia.
CSSes are Arm’s oven-ready designs that combine key SoC elements to give customers a head start when designing custom SoCs. Arm also has an ecosystem of design partners to help with implementation, if required. The overall aim is to make the path to custom silicon faster and more accessible. The recently announced Microsoft Cobalt 100 is based on second-gen CSS (specifically, CSS-N2).
Awad said hyperscalers choose Arm because the availability of CSSes means custom solutions can be created quickly and combined with Arm’s robust ecosystem.
“What we’re hearing from everywhere is that, generally speaking for hyperscalers and many of these OEMs, general-purpose compute is just not keeping up, meaning an off-the-shelf SoC is not keeping up,” he said. “We’re really optimistic about [CSSes] and we’ve seen tremendous traction with these platforms.”
The driver for hyperscalers’ desire to build their own chips is undoubtedly AI.
Awad said Arm has customers running AI inference at scale on Arm-based CPUs, in part down to the cost of custom accelerators, and in part down to their availability. (The market-leading data center GPU, Nvidia H100, is in notoriously short supply.) CPUs are widely available and very affordable compared with other options, he said.
“The decision to offload an inference job to an accelerator, whether that’s a GPU or something else, comes down to the granularity of compute you’re dealing with,” he said. “At certain granularities of compute, within the context of the workload, it make a lot more sense to keep it on the CPU…from a performance perspective, which ultimately will translate to cost.”
Arm’s vision is for a “significant percentage, if not the vast majority” of AI inference to run on CPUs eventually, particularly as models become more optimized for CPU hardware.
“There’s a lot of work going on in that sphere—the market is evolving, so it’s like, what can I throw at it to get the job done as quickly as possible,” he said. “So that’s where we’re probably seeing some GPUs or AI accelerators being used [today]. In the future, a lot of that may end up on CPUs as the market matures.”
Arm also expects its CSS designs to be used in tightly-coupled CPU-plus–accelerator designs, analogous to Nvidia Grace Hopper, which is optimized for memory capacity and bandwidth, Awad said.
CSSes don’t just work for hyperscalers; they can also support smaller companies, particularly through the Arm Total Design ecosystem of design partners, he said.
“[Smaller] companies are looking to get to market as quickly as possible to launch their solutions to capture market share, to establish themselves,” he said. “They’re also looking for a level of flexibility so that they can focus their innovation, and then they obviously need the performance to run some of these workloads.”
Collaborative relationship
With CSS, Arm takes responsibility for configuring, optimizing and validating a compute subsystem so the hyperscaler can focus on system-level workload-specific differentiation they care about, whether that’s software tuning, custom acceleration, or something else, said Dermot O’Driscoll, vice president of product solutions for Arm’s infrastructure line of business.
“They get faster time to market, they reduce the cost of engineering, and yet they take advantage of the same leading edge processor technology,” he said. “We created the CSS program to give customers the same kind of control of the silicon stack as they have over their software and system stacks today. This is a close collaborative relationship and our partners push us really hard to raise our game.”
Hyperscalers are highly focused on optimizing every layer of their infrastructure to get the best performance, especially performance per Watt, on diverse workloads, he said.
“This drives the need to understand and tune for each use,” he said. “The old cycle of software and hardware being developed in separate companies no longer keeps up with customer performance needs, or the complexity of either the software or the hardware. Customers want to see the hardware they deploy, even down to the microarchitecture, optimized to run their software workloads. This type of co-optimization is hard to do and requires significant commitment on both sides to make it work.”
Arm allows its customers to run workloads on simulations of its IP as it’s being developed, with customer feedback directly influencing how Arm evolves its architecture, O’Driscoll said.
Third-generation cores
The new Arm Neoverse CPU cores are the third generation of the N series (optimized for performance per Watt) and the V series (optimized for performance). CSS designs are available for the new N3 and V3 cores.
The CSS-N3 offers a 20% performance-per-Watt improvement per core over the CSS-N2. This CSS design comes with between 8 and 32 cores, with the 32-core version using as little as 40 W. It’s intended for telecoms, networking, DPU, and cloud applications and can be used with on-chip or separate AI accelerators. The new N3 core is based on Armv9.2 and includes 2 MB private L2 cache per core. It supports the latest versions of PCIe, CXL and UCIe.
The performance-tuned version, the V3 core, is Arm’s highest performance Neoverse core to date. CSS-V3 offers more than 50% better performance per socket compared to CSS-N2 (because this is the first CSS for V-series cores, a performance comparison to earlier CSS-V designs isn’t available). CSS-V3 can scale to 128 cores per socket for cloud, HPC and AI workloads. It supports DDR5/LPDDR5 and HBM3 memories with PCIe Gen5 and CXL 3.0 support.
While N3 has been further optimized for tasks like compression, which bring down cloud operator costs, V3’s optimizations include better performance for workloads like protocol buffers. Both show big improvements for AI data analytics; the figures in the graph above are for XGBoost, a widely-used machine learning (ML) library for regression, classification and ranking applications. Improvements to branch prediction, better management of last level cache and associated memory bandwidth, and bigger L2 cache almost doubled XGBoost performance on N3 cores versus N2.
Arm has also been looking at generative AI, ready for the shift to inference at scale that O’Driscoll says is coming. Arm showed preliminary results for Llama2-7B running on Neoverse V1 and V2 performance-optimized cores (no figures are available yet for the third-gen V3 core announced today).
Part of the equation for cost-efficient inference is throughput, O’Driscoll said, adding that token generation throughput on deployed Arm silicon is already “very good.”
“CPUs are widely available and can flexibly be used for ML or other workloads,” he said. “They are easy to deploy, support a variety of software frameworks, and are cost and energy efficient. So we know CPU inference will be a key part of the genAI computing footprint, and we can see these workloads already benefiting from ML-specific Neoverse features like B Float 16, MathML, SVE [scalable vector extensions] and SVE2 as well as our microarchitectural optimizations, and that trend will continue.”
----Form EE Times