Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting
Jiarui Wu, Zhuo Liu, Hangfeng He
TL;DR
This paper tackles the problem of hallucinations in multimodal spatial relation reasoning by LVLMs. It proposes constraint-aware prompting with two mechanisms: bidirectional constraints (BA/AB consistency) and transitivity constraints (AC/BC to constrain AB via a reference object C), implemented in a zero-shot prompting framework. Across three spatial-relation datasets, the approach significantly improves accuracy and F1 scores, with the combined constraint yielding the strongest performance and demonstrating generalization to other LVLMs. The work highlights practical benefits for reducing spatial reasoning hallucinations and suggests directions for automatic reference selection and broader spatial relation tasks.
Abstract
Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.
