Table of Contents
Fetching ...

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang

TL;DR

CrossLMM introduces a dual cross-attention framework to decouple long video sequences from LMMs, achieving substantial visual token compression with minimal performance loss. A frame-wise visual encoder, a visual-language projector, and a decoder-only LLM with V2V and T2V cross-attention enable efficient, fine-grained multimodal fusion. Extensive experiments show CrossLMM maintains strong video-understanding performance using far fewer tokens per frame and demonstrates favorable memory, compute, and latency characteristics. The approach offers a practical path to deploying long-video LMMs in resource-constrained settings, with careful consideration of data and ethical implications.

Abstract

The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

TL;DR

CrossLMM introduces a dual cross-attention framework to decouple long video sequences from LMMs, achieving substantial visual token compression with minimal performance loss. A frame-wise visual encoder, a visual-language projector, and a decoder-only LLM with V2V and T2V cross-attention enable efficient, fine-grained multimodal fusion. Extensive experiments show CrossLMM maintains strong video-understanding performance using far fewer tokens per frame and demonstrates favorable memory, compute, and latency characteristics. The approach offers a practical path to deploying long-video LMMs in resource-constrained settings, with careful consideration of data and ethical implications.

Abstract

The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.

Paper Structure

This paper contains 30 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparisons of Different Visual Token Compression Methods. (a) keeps all visual tokens. (b) and (c) merge visual tokens before and within LLMs, respectively. (d) Our method decouples visual tokens from LLMs with a dual cross-attention mechanism. We first merge visual tokens before LLMs to reduce computational cost in LLMs. Then we propose a Visual-to-Visual Cross-Attention (V2V CA) to preserve fine-grained details of original visual tokens into merged tokens, and a Text-to-Visual Cross-Attention (T2V CA) to enhance text tokens with visual information.
  • Figure 2: Architecture of CrossLMM, which consists of a visual encoder, a visual projector and an LLM. For a pretrained LLM, we insert the proposed Dual Cross-Attention Layer (DCAL) to it every $n$ layers. The DCAL is a variant of general cross-attention layer with two parallel blocks: Visual-to-Visual (V2V) Cross-Attention and Text-to-Visual (T2V) Cross-Attention. Both V2V Cross-Attn and T2V Cross-Attn aggregate fine-grained information from the original visual tokens to produce visual-enhanced video tokens and text tokens.
  • Figure 3: Implementation Details of Dual Cross-Attention. For a detailed illustration, please refer to Sec. \ref{['vcross']} and Sec. \ref{['tcross']}.
  • Figure 4: Efficiency comparison between LLaVA-OV and CrossLMM across different frame counts (32, 64, 128, and 256). (a) CUDA memory consumption measured in MB, showing CrossLMM's significantly lower memory footprint that scales more efficiently with increasing frames. (b) Computational complexity measured in TFLOPs, demonstrating CrossLMM's reduced computational requirements. (c) Prefill processing time measured in milliseconds, illustrating CrossLMM's faster processing capability. (d) Average performance improvement of CrossLMM over LLaVA-OV across all frame counts, showing substantial reductions in all metrics.