Table of Contents
Fetching ...

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang

TL;DR

This work identifies a pronounced position bias in multi-image vision-language models, where the order of images heavily influences predictions. It introduces Position-wise Question Answering (PQA) to quantify per-position reasoning, and demonstrates inter-image causal attention as the core driver of this bias. A simple, training-free remedy, SoFt Attention (SoFA), linearly interpolates between inter-image causal and bidirectional attention to smooth positional effects, applied every two layers with a small validation set to select the tilt parameter. Across multiple benchmarks and tasks, SoFA reduces position bias and yields consistent, modest gains in overall reasoning performance, including long-context scenarios. The results suggest SoFA as a practical, low-cost augmentation to enhance robustness of LVLMs in real-world multi-image applications.

Abstract

The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

TL;DR

This work identifies a pronounced position bias in multi-image vision-language models, where the order of images heavily influences predictions. It introduces Position-wise Question Answering (PQA) to quantify per-position reasoning, and demonstrates inter-image causal attention as the core driver of this bias. A simple, training-free remedy, SoFt Attention (SoFA), linearly interpolates between inter-image causal and bidirectional attention to smooth positional effects, applied every two layers with a small validation set to select the tilt parameter. Across multiple benchmarks and tasks, SoFA reduces position bias and yields consistent, modest gains in overall reasoning performance, including long-context scenarios. The results suggest SoFA as a practical, low-cost augmentation to enhance robustness of LVLMs in real-world multi-image applications.

Abstract

The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

Paper Structure

This paper contains 13 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 2: The results of multiple evaluations while shuffling image positions. We record the minimum, maximum, and average accuracy (left vertical axis), along with the prediction inconsistency between the best and worst-performing evaluations (right vertical axis).
  • Figure 3: The results on the PQA task. We report position-wise accuracy on four scenarios where the number of images is 5, 10, 15 and 20, respectively. A higher accuracy signifies strong reasoning at the specified position, whereas lower reflects poor-performing areas.
  • Figure 4: The three inter-image attention mechanisms, where $I_{1}$ and $I_{2}$ represent the tokens of two images, $T_{1}$ and $T_{2}$ represent the tokens of two text segments. Note that typically each image involves numerous tokens, e.g., 576 for LLaVA. Here for clarity we simplify them to a single token. In (A), images interact in a unidirectional manner, allowing $I_{2}$ to attend to $I_{1}$, while $I_{1}$ remains isolated. In (B), each image is isolated, indicating they can only attend to themselves. In (C), bidirectional interaction is enabled so that each image can attend to any other images. It is worth mentioning that we only alter the inter-image attention while preserving causal attention between the text segments.
  • Figure 5: The position-wise accuracy of the three attention mechanisms on the PQA task, alongside our proposed SoFA method.
  • Figure 6: The attention distribution across positions on PQA.
  • ...and 1 more figures