Table of Contents
Fetching ...

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi

TL;DR

The paper presents semantic orientation as a language-grounded representation for object orientation to enable open-world 6-DoF manipulation. It introduces OrienText300K for large-scale orientation-language data and PointSO for zero-shot semantic orientation prediction, integrated into the SoFar system that builds 6-DoF scene graphs and enables orientation-aware reasoning with VLMs. Through extensive real-world and simulated experiments, SoFar demonstrates strong zero-shot generalization and notable gains in orientation- and 6-DoF-related tasks, while highlighting areas for further improvement such as grasping robustness and end-to-end integration. Overall, the work advances language-grounded spatial reasoning for robotics by grounding orientation in semantic language and providing scalable data, models, and benchmarks.

Abstract

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation-a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env.

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

TL;DR

The paper presents semantic orientation as a language-grounded representation for object orientation to enable open-world 6-DoF manipulation. It introduces OrienText300K for large-scale orientation-language data and PointSO for zero-shot semantic orientation prediction, integrated into the SoFar system that builds 6-DoF scene graphs and enables orientation-aware reasoning with VLMs. Through extensive real-world and simulated experiments, SoFar demonstrates strong zero-shot generalization and notable gains in orientation- and 6-DoF-related tasks, while highlighting areas for further improvement such as grasping robustness and end-to-end integration. Overall, the work advances language-grounded spatial reasoning for robotics by grounding orientation in semantic language and providing scalable data, models, and benchmarks.

Abstract

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation-a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env.

Paper Structure

This paper contains 54 sections, 2 equations, 25 figures, 14 tables.

Figures (25)

  • Figure 1: Representation comparison between semantic orientation and others.
  • Figure 2: Visualization of OrienText300K data construction and validation results.
  • Figure 3: Overview of SoFar system. Given RGB-D images and language instructions, SoFar first leverages a VLM to identify relevant object phrases and semantic orientations. Then utilizes foundation models Florence-2 florence2, SAM SAM23, and our PointSO for object segmentation and semantic orientation estimation. This information forms a 6-DoF scene graph, which the VLM uses alongside the RGB image to perform spatial understanding tasks or generate manipulation actions.
  • Figure 4: Qualitative results of real world language-grounded manipulation. SoFar can generalize across various embodiments, tasks and environments.
  • Figure 5: Quantitative evaluation of zero-shot real-world language-grounded rearrangement. We design 60 diverse real-world tasks involving over 100 diverse objects (detailed in \ref{['tab:detailed_realworld']}).
  • ...and 20 more figures