Table of Contents
Fetching ...

SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, Yuzheng Zhuang

TL;DR

SpatialCoT introduces a two-stage finetuning framework to boost spatial reasoning in vision-language models for embodied task planning. By first aligning vision-language inputs with spatial coordinates and then grounding chain-of-thought reasoning into coordinate actions, SpatialCoT leverages language-based reasoning to produce fine-grained, collision-aware actions. The approach is validated on closed-loop navigation and tabletop manipulation tasks, showing clear improvements over baselines and revealing positive links between fundamental capabilities and downstream performance. A data-generation pipeline for high-quality rationales further reduces annotation costs, enabling more efficient training and robust real-world transfer.

Abstract

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.

SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

TL;DR

SpatialCoT introduces a two-stage finetuning framework to boost spatial reasoning in vision-language models for embodied task planning. By first aligning vision-language inputs with spatial coordinates and then grounding chain-of-thought reasoning into coordinate actions, SpatialCoT leverages language-based reasoning to produce fine-grained, collision-aware actions. The approach is validated on closed-loop navigation and tabletop manipulation tasks, showing clear improvements over baselines and revealing positive links between fundamental capabilities and downstream performance. A data-generation pipeline for high-quality rationales further reduces annotation costs, enabling more efficient training and robust real-world transfer.

Abstract

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.
Paper Structure (27 sections, 2 equations, 7 figures, 6 tables)

This paper contains 27 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between SpatialCoT and previous methods. a) Previous methods usually directly output the action based on the language instruction. b) SpatialCoT enhances action generation quality by effectively leveraging the reasoning capabilities of VLMs. This is achieved through a two-stage finetuning process involving spatial coordinate alignment and chain-of-thought spatial grounding.
  • Figure 2: Overview of SpatialCoT, comprising two core stages. a) Spatial coordinate bi-directional alignment, which involves translating coordinates to language (indicated by the blue to yellow arrow on the left) and language to coordinates (indicated by the yellow to blue arrow on the right). b) Chain-of-thought spatial grounding: the model first performs comprehensive thinking by generating a language-based rationale, and then grounds it in coordinate-based actions (yellow to blue dashed line), significantly improving the model's performance in complex spatial reasoning tasks.
  • Figure 3: Data collection pipeline for chain-of-thought spatial grounding
  • Figure 4: Visualization of spatial reasoning results on navigation and manipulation tasks.
  • Figure 5: Real-world rearrangement experiments: SpatialCoT arranges various object combinations into reasonable layouts, adhering to physical constraints and avoiding collisions.
  • ...and 2 more figures