Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

Xiaochen Yang; Hao Fang; Jiawei Kong; Yaoxin Mao; Bin Chen; Shu-Tao Xia

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

Xiaochen Yang, Hao Fang, Jiawei Kong, Yaoxin Mao, Bin Chen, Shu-Tao Xia

TL;DR

A structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL), which explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations.

Abstract

Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

TL;DR

Abstract

Paper Structure (14 sections, 14 equations, 4 figures, 4 tables)

This paper contains 14 sections, 14 equations, 4 figures, 4 tables.

Introduction
Related Work
Large Vision-Language Models
Multimodal Hallucination in LVLMs
Method
Preliminary
Selective Cross-Image Token Interaction
Cross-Image Attention Guided DPO for Preference Learning
Training Optimization
Experiments
Experimental Setup
Main Results
Ablation Studies
Conclusion

Figures (4)

Figure 1: Overview of our training framework and evaluation. (a) Attention Calibration: Reactivates cross-image token attention on top of causal attention. (b) Attentive Preference Contrast: Generates positive and negative attention for preference learning via inter-image reconnection and truncation. (c) Multi-Image Task: Illustrates a consistency question; incorrect answers ignore inter-image distinctions while correct ones leverage key inter-image information. (d) Evaluation: Our framework perform well on multi-image hallucination and general tasks, while preserving single-image capabilities.
Figure 2: The overview of the CAPL framework. The pipeline consists of attention modification and preference learning. Image tokens are ranked to select key tokens, and the original causal mask is modified into enhanced cross-image attention (reactivating inter-image token interactions) and truncated cross-image attention (blocking inter-image interactions). The enhanced mask is used for inference and positive sample construction, while the truncated mask generates negative samples for DPO training.
Figure 3: Comparison of three models’ accuracies on two types of negative samples: those generated by the original model structure (Original Negatives) and those generated using the truncated attention structure (Truncated Negatives). Lower accuracy indicates that the negative samples are more challenging.
Figure 4: Ablation results with respect to the Select Ratio $\rho$ and the NLL Loss Ratio $\lambda$.

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

TL;DR

Abstract

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)