Table of Contents
Fetching ...

HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

Chongyang Xu, Shen Cheng, Haipeng Li, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

TL;DR

HeRO is presented, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation.

Abstract

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

TL;DR

HeRO is presented, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation.

Abstract

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.
Paper Structure (17 sections, 9 equations, 8 figures, 4 tables)

This paper contains 17 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Pose-Aware Manipulation with Semantic Understanding.Top: Many manipulation tasks are pose-aware (e.g., placing shoes with toes aligned left), demanding semantic part perception from policies. Bottom Left: Our dense semantic fields are smoother and more consistent than the baseline G3Flow chen2025g3flow. Bottom Right: Our method achieves a 12.3% higher success rate on the dual shoe place task and a 6.5% higher average success rate of 6 challenging tasks.
  • Figure 2: Comparison of Conditioning Mechanisms.Top: The baseline employs a holistic conditioning approach, encoding the entire object point cloud into a single global vector, which lacks part-level details. Bottom: Our method uses a hierarchical approach. A global encoder captures overall context, while additional encoders extract complementary local features for fine-grained details. The resulting Hierarchical Condition provides both global and local information, enabling more precise manipulation.
  • Figure 3: Method Overview. Our framework generates precise, pose-aware actions in a two-stage process. (a) Dense Semantic Lifting: We track object 6D poses from sequential frames and lift fused 2D features from DINOv2 (semantic) and Stable Diffusion (geometric) into an object-centric Dense Semantic Point Cloud. (b) Hierarchical Conditioning Module: This point cloud is abstracted into a hierarchical Object Part Point Cloud, which conditions the diffusion policy via two pathways: as a Global Condition for the action denoiser and as fine-grained local guidance injected by a Permutation-invariant Refine module at each denoising step. This dual mechanism enables precise, pose-aware action generation.
  • Figure 4: Hierarchical Conditioning Module Architecture. Our model uses a dual-pathway design to guide the Action Denoiser. The Global Path processes the entire point cloud into a single global condition for high-level context. The Local Path partitions the point cloud into semantic parts, which are encoded by Permutation-invariant Refiners into a set of fine-grained embeddings. These local conditions are injected into the denoiser, enabling actions that are both globally consistent and locally precise.
  • Figure 5: Simulated Manipulation Tasks. We evaluate our method on six challenging tasks from the RoboTwin 2.0 benchmark. These tasks necessitate precise pose estimation and a nuanced understanding of object-part semantics to facilitate successful interaction.
  • ...and 3 more figures