Table of Contents
Fetching ...

Token Sequence Compression for Efficient Multimodal Computing

Yasmine Omri, Parth Shroff, Thierry Tambe

TL;DR

The paper addresses the high computational burden of visual tokens in Large Multimodal Models, where self-attention costs scale quadratically with token count ($O(T^2)$). It systematically benchmarks visual token reduction approaches, introducing a simple cluster-based token aggregation method that operates in the pre-LLM embedding space and often outperforms prior finetuning-free state-of-the-art techniques. Key findings show that attention-based saliency is unreliable and volatile, while importance-agnostic and cluster-based methods can achieve strong accuracy with far fewer tokens, revealing substantial redundancy in visual encoding. The work provides practical insights toward scalable, energy-efficient multimodal computing and guides future encoding and compression research for LMMs.

Abstract

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for multimodal data. In this work, we characterize a panoply of visual token selection and merging approaches through both benchmarking and qualitative analysis. In particular, we demonstrate that simple cluster-level token aggregation outperforms prior state-of-the-art works in token selection and merging, including merging at the vision encoder level and attention-based approaches. We underline the redundancy in current vision encoders, and shed light on several puzzling trends regarding principles of visual token selection through cross-modal attention visualizations. This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.

Token Sequence Compression for Efficient Multimodal Computing

TL;DR

The paper addresses the high computational burden of visual tokens in Large Multimodal Models, where self-attention costs scale quadratically with token count (). It systematically benchmarks visual token reduction approaches, introducing a simple cluster-based token aggregation method that operates in the pre-LLM embedding space and often outperforms prior finetuning-free state-of-the-art techniques. Key findings show that attention-based saliency is unreliable and volatile, while importance-agnostic and cluster-based methods can achieve strong accuracy with far fewer tokens, revealing substantial redundancy in visual encoding. The work provides practical insights toward scalable, energy-efficient multimodal computing and guides future encoding and compression research for LMMs.

Abstract

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for multimodal data. In this work, we characterize a panoply of visual token selection and merging approaches through both benchmarking and qualitative analysis. In particular, we demonstrate that simple cluster-level token aggregation outperforms prior state-of-the-art works in token selection and merging, including merging at the vision encoder level and attention-based approaches. We underline the redundancy in current vision encoders, and shed light on several puzzling trends regarding principles of visual token selection through cross-modal attention visualizations. This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.

Paper Structure

This paper contains 6 sections, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Architecture of auto-regressive visual language models.
  • Figure 2: Computational savings estimates at the LLM level from retaining 10% of visual tokens, using LLMVieweryuan2024llm.
  • Figure 3: Pipeline for dynamic training-free visual token sequence reduction pipeline.
  • Figure 4: Basic cross modality saliency visualization
  • Figure 5: (a) Performance of LLaVA1.5-7B with 11% of the salient visual tokens. (b) Layer-wise visualization of saliency heatmaps
  • ...and 1 more figures