Table of Contents
Fetching ...

UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Ziyao Wang, Chen Chen, Jingtao Li, Weiming Zhuang, Jiabo Huang, Ang Li, Lingjuan Lyu

Abstract

Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.

UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Abstract

Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.
Paper Structure (25 sections, 8 equations, 7 figures, 4 tables)

This paper contains 25 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We propose UniCompress, a plug-in-and-play token compression algorithm for unified models. The samples are from UniTok ma2025unitok.
  • Figure 2: Overview of UniCompress. The tokenizer is augmented with three modules: a global token extractor, a token compressor, and an autoregressive decompressor. The language model consumes a compact visual sequence for understanding and produces compressed-domain targets for generation.
  • Figure 3: Understanding task examples: generating the texts that describe the image.
  • Figure 4: Ablation on global token type. Results use $N_g{=}4$. Our global meta token yields competitive understanding and notably stronger generation quality (lower FID, higher CLIP).
  • Figure 5: UniCompress preserves the most visual information under compression by using global meta tokens and autoregressive decompressor.
  • ...and 2 more figures