Table of Contents
Fetching ...

SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Gao Huang

TL;DR

SpatialActor tackles the challenge of robust spatial understanding in robotic manipulation by decoupling semantics from geometry and by separating high-level geometry from low-level spatial cues. The Semantic-guided Geometric Module fuses coarse geometry from a depth-estimation expert with fine-grained, noisy depth, while the Spatial Transformer encodes spatial cues with rotary position embeddings and performs view- and scene-level interactions to guide the action head. Across 50+ tasks in RLBench, ColosseumBench, and real-world setups, SpatialActor achieves state-of-the-art performance, demonstrates strong robustness to depth noise and spatial perturbations, and shows impressive few-shot generalization. These results underscore the practical significance of disentangled spatial representations for robust and generalizable robotic manipulation in real-world environments.

Abstract

Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations. Project Page: https://shihao1895.github.io/SpatialActor

SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

TL;DR

SpatialActor tackles the challenge of robust spatial understanding in robotic manipulation by decoupling semantics from geometry and by separating high-level geometry from low-level spatial cues. The Semantic-guided Geometric Module fuses coarse geometry from a depth-estimation expert with fine-grained, noisy depth, while the Spatial Transformer encodes spatial cues with rotary position embeddings and performs view- and scene-level interactions to guide the action head. Across 50+ tasks in RLBench, ColosseumBench, and real-world setups, SpatialActor achieves state-of-the-art performance, demonstrates strong robustness to depth noise and spatial perturbations, and shows impressive few-shot generalization. These results underscore the practical significance of disentangled spatial representations for robust and generalizable robotic manipulation in real-world environments.

Abstract

Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations. Project Page: https://shihao1895.github.io/SpatialActor

Paper Structure

This paper contains 35 sections, 11 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Methodology comparisons. (a) Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. (b) Image-based methods typically entangle semantics and geometry, while inherent depth noise in real-world disrupts semantic understanding. (c) SpatialActor disentangle visual semantics, two complementary high-level geometry from noisy depth and expert priors, low-level spatial cues. (d) Performance under various degrees of noise, showing the robustness.
  • Figure 2: Overall framework of SpatialActor. The architecture employs separate vision and depth encoders. Semantic-guided Geometric Module (SGM) adaptively fuses robust yet coarse geometric priors from a pretrained depth expert with noisy depth features via gated fusion to yield high-level geometric representations. In the Spatial Transformer (SPT), low-level spatial cues are encoded as positional embeddings to drive spatial interactions. Finally, view-level interactions refine intra-view features, while scene-level interactions consolidate cross-modal information across views to support the subsequent action head.
  • Figure 3: Semantic-guided Geometric Module and Spatial Transformer. (a) SGM adaptively combines two complementary geometric representations via a gating mechanism. (b) SPT converts 3D points into spatial positional embeddings using RoPE to establish 2D–3D correspondences, followed by view-level and scene-level interactions for spatial token refinement.
  • Figure 4: Real-world tasks. We employed 8 distinct tasks with a total of 15 variants in real-world experiments.
  • Figure 5: Real-world Generalization Evaluation. We assess SpatialActor under variations in manipulated object, receiver object, brightness, and background. Performance remains robust across challenging settings.
  • ...and 7 more figures