Table of Contents
Fetching ...

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

TL;DR

This work reframes token reduction in Transformer-based generative models from a sole efficiency technique into a core design principle that shapes architecture and training across vision, language, and multimodal systems. By formalizing token reduction as a three-part pipeline—compression criteria, reduction strategies, and end-to-end integration—the paper shows how selective pruning, merging, and distillation can address visual representation, cross-modal alignment, and long-context challenges while improving training stability. It outlines a concrete research roadmap spanning algorithmic innovations, reinforcement-learning-guided token selection, and hardware-algorithm co-design, with emphasis on constructive compression and reasoning-aware token management. The proposed directions aim to enable scalable, robust, and interpretable generative systems, including dense prediction tasks, long video understanding, and cross-domain applications in science and medicine, where token-level decisions critically impact performance and reliability.

Abstract

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, agentic framework design, and broader ML and scientific domains.

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

TL;DR

This work reframes token reduction in Transformer-based generative models from a sole efficiency technique into a core design principle that shapes architecture and training across vision, language, and multimodal systems. By formalizing token reduction as a three-part pipeline—compression criteria, reduction strategies, and end-to-end integration—the paper shows how selective pruning, merging, and distillation can address visual representation, cross-modal alignment, and long-context challenges while improving training stability. It outlines a concrete research roadmap spanning algorithmic innovations, reinforcement-learning-guided token selection, and hardware-algorithm co-design, with emphasis on constructive compression and reasoning-aware token management. The proposed directions aim to enable scalable, robust, and interpretable generative systems, including dense prediction tasks, long video understanding, and cross-domain applications in science and medicine, where token-level decisions critically impact performance and reliability.

Abstract

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, agentic framework design, and broader ML and scientific domains.

Paper Structure

This paper contains 32 sections, 10 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: Timeline of notable method developments for token reduction methods with modality shifts (Vision & Language $\to$ Multimodal LLMs & Agents). All these strategies aim to speed up inference with negligible performance drops. Conversely, we ask: What is the next token reduction paradigm in generative model design that goes beyond test-time accelerations?
  • Figure 2: The token reduction pipeline. We formulate reduction as a composite of Criteria $\mathcal{E}$ (scoring) and Strategy $\mathcal{P}$ (pruning/merging).
  • Figure 3: Visualization of token reduction. (a) Image: Visual tokens are pruned based on saliency, retaining only the most salient patches. (b) Text: Low-information stop words (gray) are removed to form a compressed semantic core.