RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices
Wonkyo Choe, Yangfeng Ji, Felix Xiaozhu Lin
TL;DR
The paper tackles the challenge of deploying LLMs on resource-constrained devices by deeply compressing RWKV, an RNN-like LLM. It introduces RWKV-Lite, combining SVD-based projection compression, sparsity-aware FFN loading, embedding caches, and hierarchical heads, plus ARM NEON-accelerated inference. Across edge devices, RWKV-Lite achieves a memory reduction of $3.4 imes$–$5 imes$ (up to $10 imes$ with INT8 quantization) with negligible perplexity/accuracy loss and minor TPS impact, outperforming Transformer models at similar accuracy in memory efficiency. These contributions enable practical on-device LLM inference, closing the gap between edge capability and cloud-based models while maintaining competitive performance.
Abstract
To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.
