Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan
TL;DR
This work addresses the problem of multi-agent visual hallucination snowballing in VLM-powered MAS, where early visual misinterpretations propagate through textual flows to downstream agents. It diagnoses the cause as a degradation of visual attention across turns and identifies unimodal vision tokens in middle layers as key carriers of visual evidence. The authors propose ViF, a plug-and-play method that relays visual information via a selected set of unimodal tokens and reinforces this flow through attention reallocation, with an optional Key-Norm token selection alternative when attention scores are inaccessible. Empirical results across eight benchmarks and ten base models show consistent improvements (2.4–3.8% on average; >4% on harder benchmarks and large models) and a substantial reduction in hallucination snowballing as quantified by HS. Overall, ViF advances reliable inter-agent communication in MAS by preserving visual evidence and mitigating propagation of hallucinations, enabling more robust collaborative reasoning in visual-language contexts.
Abstract
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code is publicly available at: https://github.com/YU-deep/ViF.git.
