Table of Contents
Fetching ...

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang

TL;DR

SIFThinker tackles the challenge of spatially grounded visual reasoning in multimodal models by introducing a think-with-images framework that adaptively focuses on depth-informed regions. It combines a novel data-generation pipeline (SIF-50K) with a two-stage training regime and GRPO-SIF reinforcement learning to optimize region grounding and depth-consistent reasoning. The approach leverages a Hierarchical IoU reward and multiple task-specific rewards to foster coherent, interpretable interleaved image-text reasoning and robust 3D understanding. Empirical results demonstrate superior spatial intelligence and fine-grained visual perception across diverse benchmarks, while maintaining generalization without external tools, with code released for reproducibility.

Abstract

Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

TL;DR

SIFThinker tackles the challenge of spatially grounded visual reasoning in multimodal models by introducing a think-with-images framework that adaptively focuses on depth-informed regions. It combines a novel data-generation pipeline (SIF-50K) with a two-stage training regime and GRPO-SIF reinforcement learning to optimize region grounding and depth-consistent reasoning. The approach leverages a Hierarchical IoU reward and multiple task-specific rewards to foster coherent, interpretable interleaved image-text reasoning and robust 3D understanding. Empirical results demonstrate superior spatial intelligence and fine-grained visual perception across diverse benchmarks, while maintaining generalization without external tools, with code released for reproducibility.

Abstract

Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.

Paper Structure

This paper contains 43 sections, 9 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: The schematic illustration of SIFThinker. (a) We propose a spatially-aware image focus paradigm, in which four novel reward functions are introduced under the RL framework to guide the optimization process. (b) The training pipeline of SIFThinker is illustrated, which builds upon our proposed data generation pipeline. It begins with a warm-up stage, followed by GRPO-SIF as described in (a). (c) illustrates the inference pipeline of our method given a question-image input.
  • Figure 2: Visualization of our proposed $HIoU$ (left). The performance of $GIoU$ and $PIoU$ are illustrated respectively (right), highlighting the robustness against reward hacking.
  • Figure 3: Comparison with other open-source SOTA methods (blue) under various benchmarks in terms of the same base model. Besides, we also include the performance of the proprietary SOTA model ChatGPT-o3-2025-04-16 (green).
  • Figure 4: Visualization of the SIFThinker ’s region correction and detection (multi-objects) capabilities.
  • Figure 5: Prompt specifically crafted to guide the model in generating interleaved image-text reasoning chains, which is consistently appended during inference.
  • ...and 8 more figures