Table of Contents
Fetching ...

Rethinking Causal Mask Attention for Vision-Language Inference

Xiaohuan Pei, Tao Huang, YanXiang Ma, Chang Xu

TL;DR

The paper questions the direct application of left-to-right causal masking from LLMs to vision-language inference, arguing that strict future masking can misalign with visual processing. It introduces three future-aware masks—$M^f$, $M^{v2v}$, and $M^{v2t}$—and a lightweight kernel-pooling prefill-merge to exploit future visual semantics while preserving autoregressive decoding. Empirical results across diverse multimodal tasks show that relaxing or previewing future visual context improves temporal, visual-relational, and text-rich reasoning, though fully exposing future context increases decoding latency unless merged into a prefix during prefill. Together, these findings advocate modality-aware causal attention designs to enhance both the effectiveness and efficiency of vision-language inference.

Abstract

Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.

Rethinking Causal Mask Attention for Vision-Language Inference

TL;DR

The paper questions the direct application of left-to-right causal masking from LLMs to vision-language inference, arguing that strict future masking can misalign with visual processing. It introduces three future-aware masks—, , and —and a lightweight kernel-pooling prefill-merge to exploit future visual semantics while preserving autoregressive decoding. Empirical results across diverse multimodal tasks show that relaxing or previewing future visual context improves temporal, visual-relational, and text-rich reasoning, though fully exposing future context increases decoding latency unless merged into a prefix during prefill. Together, these findings advocate modality-aware causal attention designs to enhance both the effectiveness and efficiency of vision-language inference.

Abstract

Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.

Paper Structure

This paper contains 14 sections, 20 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Breaking the casual masks of LLaVA-7b on the ALFRED benchmarkshridhar2020alfred.
  • Figure 2: An overview of our investigation into causal attention in vision-language inference. (a.) Casual mask inference: enforces strict autoregressive decoding by blocking all future attention. (b.) Future-aware inference: enables visual tokens to preview future tokens in the upper-triangular region. (c.) Light future-aware inference: compresses future attentions into past visual positions.
  • Figure 3: An Example of Temporal Multi-Images Task, Visual Navigation
  • Figure 4: An Example of Visual Relation Tasks.
  • Figure 5: An Example of Text-Rich VQA Tasks
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 3.1: Future-Aware Full Mask
  • Definition 3.2: Future-Aware Visual-to-Visual Mask
  • Definition 3.3: Future-Aware Visual-to-Textual Mask