The Immutable Tensor Architecture: A Pure Dataflow Approach for Secure, Energy-Efficient AI Inference
Fang Li
TL;DR
The paper introduces The Immutable Tensor Architecture (ITA), a memory-hierarchy-free, dataflow ASIC design that encodes neural network weights directly into hardware. By treating weights as fixed circuit topology and employing a Split-Brain protocol, ITA offloads the dynamic KV-cache to a host while the device processes fixed-weights, achieving dramatic energy and area savings. Key innovations include logic-embedded weights via Canonical Signed Digit encoding, shift-add trees, and zero-weight pruning, yielding an estimated 50× energy efficiency and up to 4.85× Mac-gate reductions over conventional GPUs, with feasible edge deployment on 28nm CMOS. The work also discusses manufacturing economics, security advantages against model extraction, and a practical FPGA validation, while outlining limitations and future hybrid architectures to retain some adaptability. Overall, ITA offers a principled, hardware-centered path for efficient, secure edge inference for stable LLM deployments, with clear trade-offs in programmability and update speed.
Abstract
The deployment of Large Language Models (LLMs) on consumer edge devices is throttled by the "Memory Wall" -- the prohibitive bandwidth and energy cost of fetching gigabytes of model weights from DRAM for every token generated. Current architectures (GPUs, NPUs) treat model weights as mutable software data, incurring massive energy penalties to maintain general-purpose programmability. We propose The Immutable Tensor Architecture (ITA), a paradigm shift that treats model weights not as data, but as physical circuit topology. By encoding parameters directly into the metal interconnects and logic of mature-node ASICs (28nm/40nm), ITA eliminates the memory hierarchy entirely. We present a "Split-Brain" system design where a host CPU manages dynamic KV-cache operations while the ITA ASIC acts as a stateless, ROM-embedded dataflow engine.
