Table of Contents
Fetching ...

Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting

Jiarui Wu, Zhuo Liu, Hangfeng He

TL;DR

This paper tackles the problem of hallucinations in multimodal spatial relation reasoning by LVLMs. It proposes constraint-aware prompting with two mechanisms: bidirectional constraints (BA/AB consistency) and transitivity constraints (AC/BC to constrain AB via a reference object C), implemented in a zero-shot prompting framework. Across three spatial-relation datasets, the approach significantly improves accuracy and F1 scores, with the combined constraint yielding the strongest performance and demonstrating generalization to other LVLMs. The work highlights practical benefits for reducing spatial reasoning hallucinations and suggests directions for automatic reference selection and broader spatial relation tasks.

Abstract

Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.

Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting

TL;DR

This paper tackles the problem of hallucinations in multimodal spatial relation reasoning by LVLMs. It proposes constraint-aware prompting with two mechanisms: bidirectional constraints (BA/AB consistency) and transitivity constraints (AC/BC to constrain AB via a reference object C), implemented in a zero-shot prompting framework. Across three spatial-relation datasets, the approach significantly improves accuracy and F1 scores, with the combined constraint yielding the strongest performance and demonstrating generalization to other LVLMs. The work highlights practical benefits for reducing spatial reasoning hallucinations and suggests directions for automatic reference selection and broader spatial relation tasks.

Abstract

Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.

Paper Structure

This paper contains 26 sections, 16 figures, 12 tables.

Figures (16)

  • Figure 1: Comparison between the vanilla prompt and the prompt incorporating constraint awareness (transitivity constraint). Constraint-aware content is highlighted in blue, incorrect content in red, and correct content in green. In the right image, the relations highlighted in blue corrects the incorrect relation highlighted in red.
  • Figure 2: Template prompt skeleton. Prompting techniques are highlighted in blue. The phrase inside {} is the summary of omitted details, and $O_1$ and $O_2$ represent the label of objects.
  • Figure 3: Example shows how candidate objects in the question are labeled and the corresponding spatial relations in the AB, BA, AB+BA, and BA+AB orders. "Cat" is labeled as "A" because it appears earlier than "rabbit" in the question.
  • Figure 4: The accuracy comparison of different relation analysis choices in bidirectional and combined constraints is shown. BA + AB is the method utilized in our proposed approach. BA and AB + BA are the variants of our method: BA refers to analyzing only the converse relation, while AB + BA analyzes the direct relation first, followed by the converse relation. AB, which only analyzes the direct relation, is not considered a bidirectional constraint, as the converse relation is not examined. For the diagram of F1 score and detailed data, refer to Appendix \ref{['sec:appendix2']}.
  • Figure 5: The F1 score comparison of different relation analysis choices in bidirectional and combined constraints is shown.
  • ...and 11 more figures