Table of Contents
Fetching ...

L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song

TL;DR

L3TC tackles the practicality gap in learned text compression by combining a low-complexity RWKV backbone with an outlier-aware tokenizer and a high-rank reparameterization strategy. The method achieves 48% bit savings versus gzip while reducing model size by ~50x and delivering real-time decoding speeds on mobile hardware (up to MB/s). It demonstrates that careful tokenizer design and training-time capacity augmentation can preserve compression performance without increasing inference cost. The work provides a compelling route for deploying learned lossless compressors in real-world settings, balancing efficiency, speed, and compression quality.

Abstract

Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second. Our code is available at https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.

L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

TL;DR

L3TC tackles the practicality gap in learned text compression by combining a low-complexity RWKV backbone with an outlier-aware tokenizer and a high-rank reparameterization strategy. The method achieves 48% bit savings versus gzip while reducing model size by ~50x and delivering real-time decoding speeds on mobile hardware (up to MB/s). It demonstrates that careful tokenizer design and training-time capacity augmentation can preserve compression performance without increasing inference cost. The work provides a compelling route for deploying learned lossless compressors in real-world settings, balancing efficiency, speed, and compression quality.

Abstract

Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second. Our code is available at https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.

Paper Structure

This paper contains 28 sections, 4 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Compression Ratio vs. Model Size: Notable compressors, including gzip and learning-based models are compared. L3TC achieves the best compression ratio among them and reports $50\times$ model size reductions with comparable compression performance. When running on devices, other learned models typically decode at KB/s speeds, while L3TC achieves decoding speeds up to MB/s.
  • Figure 2: Overall architecture of our proposed Learned Lossless Low-Complexity Text Compression (L3TC), which consists of three primary components: tokenization, prediction and encoding. The text is firstly segmented through a novel outlier-aware tokenizer. Tokens in the vocabulary are then predicted by a low-complexity RWKV model and subsequently encoded by an arithmetic coder. Outliers, which appear infrequently, are allowed to bypass the prediction and encoding. Additionally, a high-rank reparameterization strategy is introduced to enhance the RWKV models's prediction capability during training without increasing inference complexity.
  • Figure 3: Performance with different coverage values.
  • Figure 4: Proposed high-rank reparameterization method.
  • Figure 5: Discussion on the outlier-aware tokenizer.
  • ...and 2 more figures