Table of Contents
Fetching ...

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Wenjie Liu, Hao Wu, Xin Qiu, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

TL;DR

This work identifies that dense visual processing in multimodal LLMs is largely redundant and that effective vision–language fusion occurs in a small subset of layers. It introduces ViCA, a Vision-only Cross-Attention architecture where visual tokens are frozen after projection and interact with text only through sparse cross-attention in key layers, yielding ~98% of baseline accuracy with as little as 4% vision computation. ViCA achieves practical speedups (≈3.5× single-batch and ≈10× multi-batch) and maintains compatibility with token-pruning methods, offering a hardware-friendly, scalable path to efficient multimodal reasoning. The approach generalizes across multiple backbones and benchmarks, providing a principled architectural shift for efficient multimodal fusion that complements existing pruning techniques and hardware accelerators.

Abstract

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

TL;DR

This work identifies that dense visual processing in multimodal LLMs is largely redundant and that effective vision–language fusion occurs in a small subset of layers. It introduces ViCA, a Vision-only Cross-Attention architecture where visual tokens are frozen after projection and interact with text only through sparse cross-attention in key layers, yielding ~98% of baseline accuracy with as little as 4% vision computation. ViCA achieves practical speedups (≈3.5× single-batch and ≈10× multi-batch) and maintains compatibility with token-pruning methods, offering a hardware-friendly, scalable path to efficient multimodal reasoning. The approach generalizes across multiple backbones and benchmarks, providing a principled architectural shift for efficient multimodal fusion that complements existing pruning techniques and hardware accelerators.

Abstract

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.
Paper Structure (68 sections, 21 equations, 8 figures, 13 tables)

This paper contains 68 sections, 21 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Comparison between self-attention and cross-attention architectures from a information-flow perspective. Attention and FFN operations for visual tokens dominate the computation in MLLM. The V2V, T2V, and T2T denote vision-to-vision, text-to-vision, and text-to-text attention, respectively.
  • Figure 2: Layer-wise diagnostics of visual-token updates in LLaVA-1.5-7B on TextVQA, showing output impact (Vis Attn/FFN KL) and representation change (Vis Attn/FFN 1-Cos).
  • Figure 3: Layer-wise diagnostics of text–to-vision (T2V) cross-attention in LLaVA-1.5-7B on TextVQA, showing output impact (T2V KL) and representation change (T2V 1-Cos).
  • Figure 4: Common token dropping vs. our minimal architecture (ViCA). Left: Token dropping removes some visual tokens, but the remaining ones still undergo full self-attention and FFN updates across layers, incurring substantial visual computation. Right: ViCA removes visual update paths in attention and FFN. Visual tokens act only as KVs in a few cross-attention layers, while all other operations run on text tokens, reducing visual computation.
  • Figure 5: Latency and speedup of forward pass on A6000 GPU with increasing batch sizes. Pre-training forward-pass latency of our model variants compared to the original baseline model (averaged over 100 samples). Speedup ratios are annotated on the stacked bars for LLaVA-1.5-3B, 7B, and 13B.
  • ...and 3 more figures