Table of Contents
Fetching ...

Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

Hang Yin, Xiaomin He, PeiWen Yuan, Yiwei Li, Jiayi Shi, Wenxiao Fan, Shaoxiong Feng, Kan Li

TL;DR

The paper tackles spatial hallucinations in vision-language models caused by weak spatial grounding in training data. It introduces Stitch and Tell (SiTe), a lightweight data augmentation that stitches image pairs and generates spatially explicit captions and QA prompts, providing weak but effective spatial supervision without extra annotations or heavy computation. Across three models, two datasets, and a broad suite of benchmarks, SiTe yields consistent gains in spatial reasoning while preserving or improving general vision-language performance, and ablations identify a robust 1:3 stitching ratio. The approach is simple to integrate and scalable, offering a practical path to strengthen cross-modal spatial grounding in real-world systems.

Abstract

Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named $\text{Stitch and Tell}$ (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks. Experiments show that SiTe improves spatial understanding tasks such as $\text{MME}_{\text{Position}}$ (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks including COCO-QA (+1.02%) and MMBench (+4.76%). Our findings suggest that explicitly injecting spatially-aware structure into training data offers an effective way to mitigate spatial hallucinations and improve spatial understanding, while preserving general vision-language capabilities.

Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

TL;DR

The paper tackles spatial hallucinations in vision-language models caused by weak spatial grounding in training data. It introduces Stitch and Tell (SiTe), a lightweight data augmentation that stitches image pairs and generates spatially explicit captions and QA prompts, providing weak but effective spatial supervision without extra annotations or heavy computation. Across three models, two datasets, and a broad suite of benchmarks, SiTe yields consistent gains in spatial reasoning while preserving or improving general vision-language performance, and ablations identify a robust 1:3 stitching ratio. The approach is simple to integrate and scalable, offering a practical path to strengthen cross-modal spatial grounding in real-world systems.

Abstract

Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks. Experiments show that SiTe improves spatial understanding tasks such as (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks including COCO-QA (+1.02%) and MMBench (+4.76%). Our findings suggest that explicitly injecting spatially-aware structure into training data offers an effective way to mitigate spatial hallucinations and improve spatial understanding, while preserving general vision-language capabilities.

Paper Structure

This paper contains 26 sections, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: The difference between spatially-agnostic and spatially-aware text. spatially-aware text has explicit location cues that clarify object positions.
  • Figure 2: An Overview of Stitch and Tell
  • Figure 3: A construction process of spatial visual question answering data
  • Figure 4: Qualitative comparisons between LLaVA and SiTe. SiTe can better distribute attention between objects and judge their spatial relationships.
  • Figure 5: Qualitative comparisons among the baseline LLaVA and our SiTe models.
  • ...and 1 more figures