Learning Spatial-Aware Manipulation Ordering

Yuxiang Yan; Zhiyuan Zhou; Xin Gao; Guanghao Li; Shenglin Li; Jiaqi Chen; Qunyan Pu; Jian Pu

Learning Spatial-Aware Manipulation Ordering

Yuxiang Yan, Zhiyuan Zhou, Xin Gao, Guanghao Li, Shenglin Li, Jiaqi Chen, Qunyan Pu, Jian Pu

TL;DR

OrderMind presents a unified framework for spatial-aware manipulation ordering in cluttered environments, combining a spatial context encoder with a temporal priority structuring module to directly predict manipulation priorities. It constructs a kNN-based spatial graph to capture object-object and object–manipulator interactions and introduces spatial priors, generated via a Vision-Language Model, to supervise learning without manual labeling. The approach achieves state-of-the-art performance on a large Manipulation Ordering Benchmark, delivering real-time inference and robust transfer to real-world scenes, with ablations confirming the complementary benefits of perception, context integration, and order reasoning. The work advances autonomous robotic manipulation by enabling safe, efficient, and scalable ordering in visually complex cluttered environments.

Abstract

Manipulation in cluttered environments is challenging due to spatial dependencies among objects, where an improper manipulation order can cause collisions or blocked access. Existing approaches often overlook these spatial relationships, limiting their flexibility and scalability. To address these limitations, we propose OrderMind, a unified spatial-aware manipulation ordering framework that directly learns object manipulation priorities based on spatial context. Our architecture integrates a spatial context encoder with a temporal priority structuring module. We construct a spatial graph using k-Nearest Neighbors to aggregate geometric information from the local layout and encode both object-object and object-manipulator interactions to support accurate manipulation ordering in real-time. To generate physically and semantically plausible supervision signals, we introduce a spatial prior labeling method that guides a vision-language model to produce reasonable manipulation orders for distillation. We evaluate OrderMind on our Manipulation Ordering Benchmark, comprising 163,222 samples of varying difficulty. Extensive experiments in both simulation and real-world environments demonstrate that our method significantly outperforms prior approaches in effectiveness and efficiency, enabling robust manipulation in cluttered scenes.

Learning Spatial-Aware Manipulation Ordering

TL;DR

Abstract

Learning Spatial-Aware Manipulation Ordering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)