Table of Contents
Fetching ...

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou

TL;DR

This work identifies a bottleneck in Multimodal Large Language Models where compressive projectors induce a double abstraction of visual semantics, hindering vision-language alignment. It introduces R-GAE to analyze semantic flow and demonstrates that current compressive projectors like QFormer overly abstract patch information before LLM processing. To address this, DeCo decouples token compression from semantic abstraction by performing patch-level downsampling with a parameter-free 2D AdaptiveAvgPool, leaving semantic abstraction to the LLM. Empirical results show DeCo yields consistent performance and efficiency gains across diverse benchmarks, backbones, resolutions, and LLMs, underscoring the practical value of decoupling compression from abstraction in MLLMs.

Abstract

The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains under-explored, which currently can only be inferred from the performance of MLLMs on downstream tasks. Motivated by the problem, this study examines the projector module by interpreting the vision-language semantic flow within MLLMs. Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors. Our findings reveal that compressive projectors (e.g., QFormer), abstract visual patches into a limited set of semantic concepts, such as objects or attributes, resulting in a 'double abstraction' phenomenon. This involves a first visual semantic abstraction by the projector referring to pre-defined query tokens, and a second extraction by the LLM based on text instructions. The double abstraction is inefficient in training and will result in cumulative vision semantics deficiency. To mitigate this issue, we propose the key insight of 'Decouple Compression from Abstraction (DeCo), that is compressing the visual token number at the patch level by projectors and allowing the LLM to handle visual semantic abstraction entirely. Consequently, we adopt a simple compressor, i.e., 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner. Empirical evaluation demonstrates that DeCo surpasses traditional compressive projectors regarding both performance and efficiency. It achieves performance gains of 0.9%, 7.1%, and 2.9% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks with fewer trainable parameters and faster convergence speed.

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

TL;DR

This work identifies a bottleneck in Multimodal Large Language Models where compressive projectors induce a double abstraction of visual semantics, hindering vision-language alignment. It introduces R-GAE to analyze semantic flow and demonstrates that current compressive projectors like QFormer overly abstract patch information before LLM processing. To address this, DeCo decouples token compression from semantic abstraction by performing patch-level downsampling with a parameter-free 2D AdaptiveAvgPool, leaving semantic abstraction to the LLM. Empirical results show DeCo yields consistent performance and efficiency gains across diverse benchmarks, backbones, resolutions, and LLMs, underscoring the practical value of decoupling compression from abstraction in MLLMs.

Abstract

The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains under-explored, which currently can only be inferred from the performance of MLLMs on downstream tasks. Motivated by the problem, this study examines the projector module by interpreting the vision-language semantic flow within MLLMs. Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors. Our findings reveal that compressive projectors (e.g., QFormer), abstract visual patches into a limited set of semantic concepts, such as objects or attributes, resulting in a 'double abstraction' phenomenon. This involves a first visual semantic abstraction by the projector referring to pre-defined query tokens, and a second extraction by the LLM based on text instructions. The double abstraction is inefficient in training and will result in cumulative vision semantics deficiency. To mitigate this issue, we propose the key insight of 'Decouple Compression from Abstraction (DeCo), that is compressing the visual token number at the patch level by projectors and allowing the LLM to handle visual semantic abstraction entirely. Consequently, we adopt a simple compressor, i.e., 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner. Empirical evaluation demonstrates that DeCo surpasses traditional compressive projectors regarding both performance and efficiency. It achieves performance gains of 0.9%, 7.1%, and 2.9% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks with fewer trainable parameters and faster convergence speed.
Paper Structure (29 sections, 5 equations, 12 figures, 5 tables)

This paper contains 29 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Original Images
  • Figure 2: Query-to-Patch Relevance (top img.)
  • Figure 3: Query-to-Patch Relevance (down img.)
  • Figure 5: The overall analysis framework of a typical MLLM. During image-to-text generation, we trace back the language-to-vision semantic flow utilizing R-GAE relevance maps.
  • Figure 6: Visualization of the R-GAE relevance maps across the same MLLM architecture except for projector modules. The linear projector is non-compressive while the QFormer and Adaptive Average Pooling (ours) compress the original 576 vision tokens to 64 tokens. Text-to-Patch relevance reveals the effective vision semantics aligned with the LLM during image-to-text generation. For QFormer in the second row, its Query-to-Patch map discards the fine-grained visual semantics about "purple and red". This semantic deficiency is transmitted to the final Text-to-Patch map and leads to a misalignment of vision patches and textual words.
  • ...and 7 more figures