Table of Contents
Fetching ...

RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices

Wonkyo Choe, Yangfeng Ji, Felix Xiaozhu Lin

TL;DR

The paper tackles the challenge of deploying LLMs on resource-constrained devices by deeply compressing RWKV, an RNN-like LLM. It introduces RWKV-Lite, combining SVD-based projection compression, sparsity-aware FFN loading, embedding caches, and hierarchical heads, plus ARM NEON-accelerated inference. Across edge devices, RWKV-Lite achieves a memory reduction of $3.4 imes$–$5 imes$ (up to $10 imes$ with INT8 quantization) with negligible perplexity/accuracy loss and minor TPS impact, outperforming Transformer models at similar accuracy in memory efficiency. These contributions enable practical on-device LLM inference, closing the gap between edge capability and cloud-based models while maintaining competitive performance.

Abstract

To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.

RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices

TL;DR

The paper tackles the challenge of deploying LLMs on resource-constrained devices by deeply compressing RWKV, an RNN-like LLM. It introduces RWKV-Lite, combining SVD-based projection compression, sparsity-aware FFN loading, embedding caches, and hierarchical heads, plus ARM NEON-accelerated inference. Across edge devices, RWKV-Lite achieves a memory reduction of (up to with INT8 quantization) with negligible perplexity/accuracy loss and minor TPS impact, outperforming Transformer models at similar accuracy in memory efficiency. These contributions enable practical on-device LLM inference, closing the gap between edge capability and cloud-based models while maintaining competitive performance.

Abstract

To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.

Paper Structure

This paper contains 47 sections, 10 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: A proof-of-concept system we built, which runs the compressed RWKV model reported in the paper and demonstrates the concept of running LLMs on wearable devices. See Table \ref{['tab:platform']} for hardware details.
  • Figure 2: Simplified architecture of the RWKV model. Each variant has multiple ($L$) numbers of RWKV blocks, which comprise time and channel-mix layers. Colored blocks are our techniques onto the original layers. (LN=Layer Normalization).
  • Figure 3: Average FFN's sparsity ratio (the fraction of zero values in activations), showing substantial sparsity across layers and unused weight row/columns were loaded to memory. Tested on 200 token generations in the channel-mix layer of the small RWKV model.
  • Figure 4: An illustration of hierarchical heads, comprising a cluster head and many per-cluster token heads. At inference time, it computes the probability of each cluster, selects most probable clusters and selectively load their token heads to compute the token logits (orange/green boxes). Finally, it computes pseudo logits for tokens in unselected clusters (blue-pattern filled box).
  • Figure 5: Accuracy & memory footprint comparison between RWKV and transformer models. RWKV-ours has smaller memory footprint than other models and still maintain the comparable accuracy under both loading strategies. All model weights in FP16. Benchmark: lambada_openai.
  • ...and 7 more figures