Table of Contents
Fetching ...

The Immutable Tensor Architecture: A Pure Dataflow Approach for Secure, Energy-Efficient AI Inference

Fang Li

TL;DR

The paper introduces The Immutable Tensor Architecture (ITA), a memory-hierarchy-free, dataflow ASIC design that encodes neural network weights directly into hardware. By treating weights as fixed circuit topology and employing a Split-Brain protocol, ITA offloads the dynamic KV-cache to a host while the device processes fixed-weights, achieving dramatic energy and area savings. Key innovations include logic-embedded weights via Canonical Signed Digit encoding, shift-add trees, and zero-weight pruning, yielding an estimated 50× energy efficiency and up to 4.85× Mac-gate reductions over conventional GPUs, with feasible edge deployment on 28nm CMOS. The work also discusses manufacturing economics, security advantages against model extraction, and a practical FPGA validation, while outlining limitations and future hybrid architectures to retain some adaptability. Overall, ITA offers a principled, hardware-centered path for efficient, secure edge inference for stable LLM deployments, with clear trade-offs in programmability and update speed.

Abstract

The deployment of Large Language Models (LLMs) on consumer edge devices is throttled by the "Memory Wall" -- the prohibitive bandwidth and energy cost of fetching gigabytes of model weights from DRAM for every token generated. Current architectures (GPUs, NPUs) treat model weights as mutable software data, incurring massive energy penalties to maintain general-purpose programmability. We propose The Immutable Tensor Architecture (ITA), a paradigm shift that treats model weights not as data, but as physical circuit topology. By encoding parameters directly into the metal interconnects and logic of mature-node ASICs (28nm/40nm), ITA eliminates the memory hierarchy entirely. We present a "Split-Brain" system design where a host CPU manages dynamic KV-cache operations while the ITA ASIC acts as a stateless, ROM-embedded dataflow engine.

The Immutable Tensor Architecture: A Pure Dataflow Approach for Secure, Energy-Efficient AI Inference

TL;DR

The paper introduces The Immutable Tensor Architecture (ITA), a memory-hierarchy-free, dataflow ASIC design that encodes neural network weights directly into hardware. By treating weights as fixed circuit topology and employing a Split-Brain protocol, ITA offloads the dynamic KV-cache to a host while the device processes fixed-weights, achieving dramatic energy and area savings. Key innovations include logic-embedded weights via Canonical Signed Digit encoding, shift-add trees, and zero-weight pruning, yielding an estimated 50× energy efficiency and up to 4.85× Mac-gate reductions over conventional GPUs, with feasible edge deployment on 28nm CMOS. The work also discusses manufacturing economics, security advantages against model extraction, and a practical FPGA validation, while outlining limitations and future hybrid architectures to retain some adaptability. Overall, ITA offers a principled, hardware-centered path for efficient, secure edge inference for stable LLM deployments, with clear trade-offs in programmability and update speed.

Abstract

The deployment of Large Language Models (LLMs) on consumer edge devices is throttled by the "Memory Wall" -- the prohibitive bandwidth and energy cost of fetching gigabytes of model weights from DRAM for every token generated. Current architectures (GPUs, NPUs) treat model weights as mutable software data, incurring massive energy penalties to maintain general-purpose programmability. We propose The Immutable Tensor Architecture (ITA), a paradigm shift that treats model weights not as data, but as physical circuit topology. By encoding parameters directly into the metal interconnects and logic of mature-node ASICs (28nm/40nm), ITA eliminates the memory hierarchy entirely. We present a "Split-Brain" system design where a host CPU manages dynamic KV-cache operations while the ITA ASIC acts as a stateless, ROM-embedded dataflow engine.

Paper Structure

This paper contains 50 sections, 5 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Split-Brain Architecture. The host manages dynamic KV-cache in system RAM and computes attention. The ITA device contains static weights as physical logic and computes linear projections. Only activation vectors traverse the host-device interface (PCIe, Thunderbolt, or USB).
  • Figure 2: Energy breakdown per parameter operation. ITA eliminates the dominant DRAM fetch cost (red), achieving 50$\times$ improvement vs. INT8 GPU baseline.
  • Figure 3: Economic barrier to model extraction. ITA raises the cost floor from $1K (software tools) to $50K+ (specialized equipment and expertise).