Table of Contents
Fetching ...

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki

TL;DR

This work addresses the challenge of spatial reasoning across multiple views by introducing HATCH, a training framework that explicitly handles cross-view correspondence and stepwise viewpoint transformation. PaStA provides geometry-supervised patch-level alignment across views, while ActoR enforces an explicit sequence of viewpoint-transition actions before answering, trained with reinforcement learning and verifiable rewards. Across three benchmarks, HATCH yields substantial gains over baselines of similar size and remains competitive with larger models, while also preserving single-image reasoning performance. The approach advances multi-image spatial reasoning by embedding interpretable geometric reasoning steps into the training objective, enabling more robust cross-view integration in vision–language models.

Abstract

While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

TL;DR

This work addresses the challenge of spatial reasoning across multiple views by introducing HATCH, a training framework that explicitly handles cross-view correspondence and stepwise viewpoint transformation. PaStA provides geometry-supervised patch-level alignment across views, while ActoR enforces an explicit sequence of viewpoint-transition actions before answering, trained with reinforcement learning and verifiable rewards. Across three benchmarks, HATCH yields substantial gains over baselines of similar size and remains competitive with larger models, while also preserving single-image reasoning performance. The approach advances multi-image spatial reasoning by embedding interpretable geometric reasoning steps into the training objective, enabling more robust cross-view integration in vision–language models.

Abstract

While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.
Paper Structure (41 sections, 11 equations, 5 figures, 5 tables)

This paper contains 41 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Two cognitive mechanisms underlying multi-image spatial reasoning: (a) cross-view correspondence, identifying regions across views that correspond to the same physical locations; (b) stepwise viewpoint transformation, composing relative viewpoint changes (e.g., rotations) in a sequential manner.
  • Figure 2: $\text{HATCH}$ pipeline overview. $\text{HATCH}$ consists of two components: (i) $\text{Patch-Level Spatial Alignment}$ ($\text{PaStA}$) to learn cross-view correspondence, (ii) $\text{Action-then-Answer Reasoning}$ ($\text{ActoR}$) to perform stepwise viewpoint transformation via explicit actions.
  • Figure 3: Training dynamics during GRPO training of $\text{ActoR}$. Action and QA accuracy rewards are shown for $\text{HATCH}$ (yellow) and an ablated variant without $\text{PaStA}$ (blue).
  • Figure 4: Grid resolution analysis for $\text{PaStA}$. SPAR-Bench-MV average accuracy improves up to $n=4$ and drops for $n \geq 5$, indicating that overly fine grids hurt correspondence learning.
  • Figure 5: Qualitative success and failure cases for different reasoning modalities. Compared with natural language reasoning, action-based reasoning ($\text{HATCH}$) yields explicit, quantitative camera operations that more directly support correct multi-image spatial inference.