Table of Contents
Fetching ...

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

TL;DR

This work addresses vision-language misalignment in decoder-only Multimodal LLMs by redesigning the attention mechanism. It introduces modality-mutual attention (MMA) within AKI, which unlocks cross-modal information flow by modifying the LLM’s attention mask so that image tokens can attend to text tokens, without adding parameters or training time. In extensive experiments across 12 benchmarks, MMA (and the AKI-4B variant) outperforms state-of-the-art baselines and DOT variants, demonstrating robust cross-modal understanding with scalable applicability to other modality pairs. The approach is architecture-centric, data-efficient, and openly released to spur further exploration of cross-modal interaction in multimodal foundations models.

Abstract

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem, we propose \MapleLeaf AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios. The code and model are publicly available at https://github.com/sony/aki to encourage further advancements in MLLMs across various directions.

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

TL;DR

This work addresses vision-language misalignment in decoder-only Multimodal LLMs by redesigning the attention mechanism. It introduces modality-mutual attention (MMA) within AKI, which unlocks cross-modal information flow by modifying the LLM’s attention mask so that image tokens can attend to text tokens, without adding parameters or training time. In extensive experiments across 12 benchmarks, MMA (and the AKI-4B variant) outperforms state-of-the-art baselines and DOT variants, demonstrating robust cross-modal understanding with scalable applicability to other modality pairs. The approach is architecture-centric, data-efficient, and openly released to spur further exploration of cross-modal interaction in multimodal foundations models.

Abstract

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem, we propose \MapleLeaf AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios. The code and model are publicly available at https://github.com/sony/aki to encourage further advancements in MLLMs across various directions.

Paper Structure

This paper contains 30 sections, 7 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: An illustration of the vision-centric scenario. The image contains ambiguous signs with the object-related query. The correct answer is that parking is allowed for 2 hours from 8am to 8pm on Saturday. While GPT-4o gpt-4o, Molmo DBLP:journals/corr/abs-2409-17146, and DeepSeek-VL2-Small wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels respond with hallucinations, our proposed AKI is able to provide accurate answer. The image is sourced from parking_illustration.
  • Figure 2: The conventional framework for MLLMs (e.g., Molmo DBLP:journals/corr/abs-2409-17146, BLIP-3 DBLP:journals/corr/abs-2408-08872, and Cambrian tong2024cambrian) typically consists of a vision encoder, a vision-language (VL) connector, and a text decoder (LLM). In this framework, images are often placed before text in a sequentialized input, causing the former modality (images) lacking access to information from the later modality (text) due to the causal attention design in decoder-only LLMs, as shown in the right part (gray squares). Notably, placing text before image tokens does not resolve this issue, as the fundamental limitation persists.
  • Figure 3: An illustration for dual-order training, where T and I indicate text and images, respectively.
  • Figure 4: The prompt template for the I&T and T&I input orders. {image patch} and {question} are replaced based on each data sample.
  • Figure 5: An illustration for our proposed modality-mutual attention (MMA), which modifies the causal attention mask in the LLM (gray squares) by enabling the information flow from image tokens to text tokens (blue squares).
  • ...and 6 more figures