Table of Contents
Fetching ...

FengHuang: Next-Generation Memory Orchestration for AI Inferencing

Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou

TL;DR

The paper tackles memory-capacity, memory-bandwidth, and interconnect bottlenecks in AI inference by proposing FengHuang, a disaggregated shared-memory platform that decouples compute from memory through a Tensor Addressable Bridge (TAB) and a two-tier memory system. It introduces two hardware innovations—a tensor prefetcher to hide remote-memory latency and a shared-memory design for inter–xPU communication, including five core operations (AllReduce, ReduceScatter, AllGather, AllToAll, P2P)—and provides a theoretical and simulation-backed analysis showing substantial speedups over conventional NVLink-based scaling. Across workloads such as GPT-3, Grok-1, and Qwen-3 235B, FengHuang achieves up to $93\%$ local-memory capacity reduction, $50\%$ GPU compute savings, and $16$–$70\times$ faster inter-GPU communication, enabling significant GPU count reductions without sacrificing end-user performance. The framework emphasizes open, vendor-agnostic design principles and a scalable rack-level memory-centric approach that could reduce infrastructure and power costs while improving AI inference efficiency at warehouse scale.

Abstract

This document presents a vision for a novel AI infrastructure design that has been initially validated through inference simulations on state-of-the-art large language models. Advancements in deep learning and specialized hardware have driven the rapid growth of large language models (LLMs) and generative AI systems. However, traditional GPU-centric architectures face scalability challenges for inference workloads due to limitations in memory capacity, bandwidth, and interconnect scaling. To address these issues, the FengHuang Platform, a disaggregated AI infrastructure platform, is proposed to overcome memory and communication scaling limits for AI inference. FengHuang features a multi-tier shared-memory architecture combining high-speed local memory with centralized disaggregated remote memory, enhanced by active tensor paging and near-memory compute for tensor operations. Simulations demonstrate that FengHuang achieves up to 93% local memory capacity reduction, 50% GPU compute savings, and 16x to 70x faster inter-GPU communication compared to conventional GPU scaling. Across workloads such as GPT-3, Grok-1, and QWEN3-235B, FengHuang enables up to 50% GPU reductions while maintaining end-user performance, offering a scalable, flexible, and cost-effective solution for AI inference infrastructure. FengHuang provides an optimal balance as a rack-level AI infrastructure scale-up solution. Its open, heterogeneous design eliminates vendor lock-in and enhances supply chain flexibility, enabling significant infrastructure and power cost reductions.

FengHuang: Next-Generation Memory Orchestration for AI Inferencing

TL;DR

The paper tackles memory-capacity, memory-bandwidth, and interconnect bottlenecks in AI inference by proposing FengHuang, a disaggregated shared-memory platform that decouples compute from memory through a Tensor Addressable Bridge (TAB) and a two-tier memory system. It introduces two hardware innovations—a tensor prefetcher to hide remote-memory latency and a shared-memory design for inter–xPU communication, including five core operations (AllReduce, ReduceScatter, AllGather, AllToAll, P2P)—and provides a theoretical and simulation-backed analysis showing substantial speedups over conventional NVLink-based scaling. Across workloads such as GPT-3, Grok-1, and Qwen-3 235B, FengHuang achieves up to local-memory capacity reduction, GPU compute savings, and faster inter-GPU communication, enabling significant GPU count reductions without sacrificing end-user performance. The framework emphasizes open, vendor-agnostic design principles and a scalable rack-level memory-centric approach that could reduce infrastructure and power costs while improving AI inference efficiency at warehouse scale.

Abstract

This document presents a vision for a novel AI infrastructure design that has been initially validated through inference simulations on state-of-the-art large language models. Advancements in deep learning and specialized hardware have driven the rapid growth of large language models (LLMs) and generative AI systems. However, traditional GPU-centric architectures face scalability challenges for inference workloads due to limitations in memory capacity, bandwidth, and interconnect scaling. To address these issues, the FengHuang Platform, a disaggregated AI infrastructure platform, is proposed to overcome memory and communication scaling limits for AI inference. FengHuang features a multi-tier shared-memory architecture combining high-speed local memory with centralized disaggregated remote memory, enhanced by active tensor paging and near-memory compute for tensor operations. Simulations demonstrate that FengHuang achieves up to 93% local memory capacity reduction, 50% GPU compute savings, and 16x to 70x faster inter-GPU communication compared to conventional GPU scaling. Across workloads such as GPT-3, Grok-1, and QWEN3-235B, FengHuang enables up to 50% GPU reductions while maintaining end-user performance, offering a scalable, flexible, and cost-effective solution for AI inference infrastructure. FengHuang provides an optimal balance as a rack-level AI infrastructure scale-up solution. Its open, heterogeneous design eliminates vendor lock-in and enhances supply chain flexibility, enabling significant infrastructure and power cost reductions.

Paper Structure

This paper contains 38 sections, 5 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Growing number of AI users worldwide and growing sizes of the state-of-the-art AI models. Data is taken from from brown2020gpt3smith2021mtnlgchowdhery2022palmfedus2022switchdu2022glamresourcera2025altindex2025.
  • Figure 2: High-level design of a conventional inference node. Separate devices are bound by inter-device interconnect with relatively low bandwidth. In such systems, increasing memory capacity necessarily entails adding additional compute hardware. Sharing data between xPUs involves transferring data over a relatively slow interconnect.
  • Figure 3: High-level design of a FengHuang node. Most of the node's memory is connected to the Tensor Addressable Bridge (TAB) and is shared between all xPUs. Memory scaling can be done separately from compute scaling. Exchanging data between devices is quick and done through the shared memory.
  • Figure 4: Model Memory Capacity Requirements (Batch Size = 16)
  • Figure 5: MFU (Model FLOPs Utilization) vs. Batch Size
  • ...and 16 more figures