SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference
Jeff Smith
TL;DR
This work identifies a memory bottleneck in Kolmogorov-Arnold Networks (KANs) used for vision, revealing a holographic topology where information is distributed across spline bases and pruning becomes catastrophically ineffective. It introduces SHARe-KAN, a post-training compression framework that uses Gain-Shape-Bias vector quantization to share a layer-wide codebook across edges, combined with LUTHAM, a hardware-aware memory planner and zero-copy runtime. The authors demonstrate an 88× reduction in runtime memory (to 12.9 MB) on PASCAL VOC with negligible in-domain accuracy loss, and show cache-resident operation on NVIDIA Ampere GPUs with over 90% L2 residency, effectively decoupling computation from DRAM bandwidth. These results establish a practical path for deploying expressive spline-based networks in memory-bound scenarios and point to future directions in universal codebooks and scalable mixtures of experts for edge-efficient AI.
Abstract
Kolmogorov-Arnold Networks (KANs) face a fundamental memory wall: their learned basis functions create parameter counts that impose extreme bandwidth demands, hindering deployment in memory-constrained environments. We show that Vision KANs exhibit a holographic topology, where information is distributed across the interference of splines rather than localized to specific edges. Consequently, traditional pruning fails (10% sparsity degrades mAP from 85.23% to 45%, a $\sim$40-point drop). To address this, we present SHARe-KAN, a framework utilizing Gain-Shape-Bias Vector Quantization to exploit functional redundancy while preserving the dense topology. Coupled with LUTHAM, a hardware-aware compiler with static memory planning, we achieve $88\times$ runtime memory reduction (1.13 GB $\to$ 12.91 MB) and match uncompressed baseline accuracy on PASCAL VOC. Profiling on NVIDIA Ampere architecture confirms $>90\%$ L2 cache residency, demonstrating that the workload is decoupled from DRAM bandwidth constraints inherent to spline-based architectures.
