Table of Contents
Fetching ...

Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates

Yang Liu, Yi Chen, Yongwei Zhao, Yifan Hao, Zifu Zheng, Weihao Kong, Zhangmai Li, Dongchen Jiang, Ruiyang Xia, Zhihong Ma, Zisheng Liu, Zhaoyong Wan, Yunqi Lu, Ximing Liu, Hongrui Guo, Zhihao Yang, Zhe Wang, Tianrui Ma, Mo Zou, Rui Zhang, Ling Li, Xing Hu, Zidong Du, Zhiwei Xu, Qi Guo, Tianshi Chen, Yunji Chen

TL;DR

This paper addresses the unsustainable energy footprint of LLM inference by proposing an extreme specialization: a Hardwired-Neurons Language Processing Unit (HNLPU) that embeds LLM weights directly into hardware. Central to this is Metal-Embedding, which encodes weights in a 3D metal-wire topology, enabling a 15x density increase and shared photomasks across chips, dramatically reducing NRE costs. The first design, HNLPU, implements gpt-oss 120 B at 5 nm across 16 chips, achieving 249,960 tokens/s and 36 tokens/J, with a total area of 13,232 mm² and a 112x reduction in mask costs compared to naive hardwiring, leading to 41.7–80.4x lower TCO and 357x lower carbon footprint against OpenAI-scale H100 clusters. These results imply that an economically viable, ultra-efficient cognitive substrate for general tasks is achievable, potentially transforming cloud inference economics and sustainability.

Abstract

The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. A straightforward hardwiring of gpt-oss 120 B would require fabricating photomask sets valued at over 6 billion dollars, rendering this straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 photomask layers are homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x that of GPU/WSE), 36 tokens/J (1,047x/283x that of GPU/WSE), 13,232 mm2 total die area, $59.46 M-123.5 M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 41.7-80.4x improvement in cost-effectiveness and 357x reduction in carbon footprint compared to OpenAI-scale H100 clusters, under an annual weight updating assumption.

Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates

TL;DR

This paper addresses the unsustainable energy footprint of LLM inference by proposing an extreme specialization: a Hardwired-Neurons Language Processing Unit (HNLPU) that embeds LLM weights directly into hardware. Central to this is Metal-Embedding, which encodes weights in a 3D metal-wire topology, enabling a 15x density increase and shared photomasks across chips, dramatically reducing NRE costs. The first design, HNLPU, implements gpt-oss 120 B at 5 nm across 16 chips, achieving 249,960 tokens/s and 36 tokens/J, with a total area of 13,232 mm² and a 112x reduction in mask costs compared to naive hardwiring, leading to 41.7–80.4x lower TCO and 357x lower carbon footprint against OpenAI-scale H100 clusters. These results imply that an economically viable, ultra-efficient cognitive substrate for general tasks is achievable, potentially transforming cloud inference economics and sustainability.

Abstract

The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. A straightforward hardwiring of gpt-oss 120 B would require fabricating photomask sets valued at over 6 billion dollars, rendering this straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 photomask layers are homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x that of GPU/WSE), 36 tokens/J (1,047x/283x that of GPU/WSE), 13,232 mm2 total die area, $59.46 M-123.5 M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 41.7-80.4x improvement in cost-effectiveness and 357x reduction in carbon footprint compared to OpenAI-scale H100 clusters, under an annual weight updating assumption.

Paper Structure

This paper contains 41 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Hardwired LPU as a general-purpose processor. To the left: AI Infrastructures are originated in the rapidly evolving deep learning which appreciates universality over extreme efficiency. To the right: As LLM develops, the responsibilities of universality are shifting from HW/SW to LLMs. An extremely specialized Hardwired LPU can also be helpful in general tasks.
  • Figure 2: Economic challenges of hardwiring. Considering the cost on photomasks and wafers, the cost on photomasks was amortized by the mass production of GPUs. Hardwiring an LLM incurs too many photomasks and too low volume to amortize the NRE costs.
  • Figure 3: Key arithmetic techniques. To the middle: Combining repeated multipliers via the distributive law. To the right: Using Carry Save Adders (CSA) on bit-serialized inputs to trade time for area.
  • Figure 4: Hardwired-Neuron architecture. ❶ A conventional cell-embedding neuron contains 2,880 4b multipliers (16 shown) followed by an 8b×2,880 adder tree, where 2,880 is the hidden size in gpt-oss 120 B; ❷ With ME, Hardwired-Neurons accept 1b serialized inputs (LSB-first), (1) route the inputs multiplying the same weight value to the same region, (2) perform accumulation (POPCNT) on these input signals, (3) perform actual multiplication with 16 multipliers (4 shown), (4) sum the results with a 4b×16 adder tree. Note how ❷ is significantly smaller in area than ❶ by reducing the number of multipliers and the strength of adders.
  • Figure 5: Step-by-step schematic showing how weights are physically embedded in the 3D metal wire topology. HNs are accumulate-multiply-accumulate arithmetic units where each weight parameter is expressed by the source and destination of a metal wire: ❶ $ax_1$ by connecting from $x_1$ to the blue region; ❷ $ax_2$ by connecting from $x_2$ to the blue region; ❸ $cx_3$ by connecting from $x_3$ to the red region; ❹ $cx_4$ by connecting from $x_4$ to the red region.
  • ...and 9 more figures