Table of Contents
Fetching ...

SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference

Jeff Smith

TL;DR

This work identifies a memory bottleneck in Kolmogorov-Arnold Networks (KANs) used for vision, revealing a holographic topology where information is distributed across spline bases and pruning becomes catastrophically ineffective. It introduces SHARe-KAN, a post-training compression framework that uses Gain-Shape-Bias vector quantization to share a layer-wide codebook across edges, combined with LUTHAM, a hardware-aware memory planner and zero-copy runtime. The authors demonstrate an 88× reduction in runtime memory (to 12.9 MB) on PASCAL VOC with negligible in-domain accuracy loss, and show cache-resident operation on NVIDIA Ampere GPUs with over 90% L2 residency, effectively decoupling computation from DRAM bandwidth. These results establish a practical path for deploying expressive spline-based networks in memory-bound scenarios and point to future directions in universal codebooks and scalable mixtures of experts for edge-efficient AI.

Abstract

Kolmogorov-Arnold Networks (KANs) face a fundamental memory wall: their learned basis functions create parameter counts that impose extreme bandwidth demands, hindering deployment in memory-constrained environments. We show that Vision KANs exhibit a holographic topology, where information is distributed across the interference of splines rather than localized to specific edges. Consequently, traditional pruning fails (10% sparsity degrades mAP from 85.23% to 45%, a $\sim$40-point drop). To address this, we present SHARe-KAN, a framework utilizing Gain-Shape-Bias Vector Quantization to exploit functional redundancy while preserving the dense topology. Coupled with LUTHAM, a hardware-aware compiler with static memory planning, we achieve $88\times$ runtime memory reduction (1.13 GB $\to$ 12.91 MB) and match uncompressed baseline accuracy on PASCAL VOC. Profiling on NVIDIA Ampere architecture confirms $>90\%$ L2 cache residency, demonstrating that the workload is decoupled from DRAM bandwidth constraints inherent to spline-based architectures.

SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference

TL;DR

This work identifies a memory bottleneck in Kolmogorov-Arnold Networks (KANs) used for vision, revealing a holographic topology where information is distributed across spline bases and pruning becomes catastrophically ineffective. It introduces SHARe-KAN, a post-training compression framework that uses Gain-Shape-Bias vector quantization to share a layer-wide codebook across edges, combined with LUTHAM, a hardware-aware memory planner and zero-copy runtime. The authors demonstrate an 88× reduction in runtime memory (to 12.9 MB) on PASCAL VOC with negligible in-domain accuracy loss, and show cache-resident operation on NVIDIA Ampere GPUs with over 90% L2 residency, effectively decoupling computation from DRAM bandwidth. These results establish a practical path for deploying expressive spline-based networks in memory-bound scenarios and point to future directions in universal codebooks and scalable mixtures of experts for edge-efficient AI.

Abstract

Kolmogorov-Arnold Networks (KANs) face a fundamental memory wall: their learned basis functions create parameter counts that impose extreme bandwidth demands, hindering deployment in memory-constrained environments. We show that Vision KANs exhibit a holographic topology, where information is distributed across the interference of splines rather than localized to specific edges. Consequently, traditional pruning fails (10% sparsity degrades mAP from 85.23% to 45%, a 40-point drop). To address this, we present SHARe-KAN, a framework utilizing Gain-Shape-Bias Vector Quantization to exploit functional redundancy while preserving the dense topology. Coupled with LUTHAM, a hardware-aware compiler with static memory planning, we achieve runtime memory reduction (1.13 GB 12.91 MB) and match uncompressed baseline accuracy on PASCAL VOC. Profiling on NVIDIA Ampere architecture confirms L2 cache residency, demonstrating that the workload is decoupled from DRAM bandwidth constraints inherent to spline-based architectures.

Paper Structure

This paper contains 51 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The pruning cliff. Vision KANs suffer catastrophic performance collapse under magnitude-based pruning, contrasting with the gradual degradation of standard MLPs, indicating information is distributed rather than localized.
  • Figure 2: Compression vs. Accuracy Trade-off. SHARe-KAN (Int8) achieves competitive accuracy with 17$\times$ smaller model size than Dense KAN, approaching ResNet-50 MLP performance in a 12.91 MB footprint.
  • Figure 3: VQ Saturation. Reconstruction quality (R²) reaches saturation at $K=65{,}536$, justifying 16-bit index allocation.