Table of Contents
Fetching ...

DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor

Yan Zhao, Zhengxue Cheng, Junxuan Zhang, Qunshan Gu, Qi Wang, Li Song

TL;DR

DualComp introduces a unified, lightweight dual-modality lossless compressor for image and text built on the RWKV-7 backbone. It uses modality-unified tokenization, modality-switching contextual learning, and modality-routing MoE, together with a reparameterization training strategy, to achieve near real-time CPU inference and state-of-the-art or competitive compression with far fewer parameters than LLM-based equivalents. The approach delivers Kodak results around $2.57$ bits/Byte (DualComp-I) and enwik9 results around $1.107$ bits/Byte with substantially reduced compute, while maintaining cross-modality consistency and efficient parameter utilization. These contributions offer practical, scalable solutions for multi-modal data compression and pave the way for broader multi-modal, low-latency lossless codecs in edge and desktop environments.

Abstract

Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.

DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor

TL;DR

DualComp introduces a unified, lightweight dual-modality lossless compressor for image and text built on the RWKV-7 backbone. It uses modality-unified tokenization, modality-switching contextual learning, and modality-routing MoE, together with a reparameterization training strategy, to achieve near real-time CPU inference and state-of-the-art or competitive compression with far fewer parameters than LLM-based equivalents. The approach delivers Kodak results around bits/Byte (DualComp-I) and enwik9 results around bits/Byte with substantially reduced compute, while maintaining cross-modality consistency and efficient parameter utilization. These contributions offer practical, scalable solutions for multi-modal data compression and pave the way for broader multi-modal, low-latency lossless codecs in edge and desktop environments.

Abstract

Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.

Paper Structure

This paper contains 37 sections, 4 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Left: Most existing lossless compressors support only a single modality, whereas DualComp enables dual-modality compression in one model. Right: Lossless compression performance (bits/Byte) on image (Kodak) and text (enwik9) datasets. DualComp matches or surpasses SOTA methods with fewer parameters on both image and text.
  • Figure 2: DualComp tokenizes text and image inputs with a unified vocabulary, then encodes them into a compressed bitstream via context probabilities and arithmetic coding. Built on a lightweight backbone, it further incorporates modality-switching context learning and a modality-routing mixture-of-experts for efficient dual-modality compression.
  • Figure 3: Dual-modality tokenization: images are patched and scanned into 1D sequences, with each subpixel as a token. Text is tokenized using an SPM-BPE tokenizer. The two modalities share a unified vocabulary of 16K size.
  • Figure 4: Percent of expert usage in each batch when compressing image (left) and text (right) using DualComp-16M.
  • Figure 5: Left: Learning-based methods' image compression performance (bits/Byte vs. model size). The closer to the bottom-left corner, the smaller the model and the better the performance. Right: Dual-modality compression consistency. The x-axis and y-axis are bits/Byte on text and image datasets, respectively. $*$ marks pretrained LLMs.
  • ...and 6 more figures