Table of Contents
Fetching ...

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

Tianxing Zhou, Feiyang Xue, Zhangchen Ye, Tianyuan Yuan, Hang Zhao, Tao Jiang

Abstract

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

Abstract

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.
Paper Structure (16 sections, 4 equations, 5 figures, 5 tables)

This paper contains 16 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: VolumeDP consistently outperforms Diffusion Policy DiffusionPolicy and DiT-Block Policy DiTBlockPolicy by significant margins across LIBERO, ManiSkill, LIBERO-plus simulation benchmarks and real-world manipulation tasks.
  • Figure 2: Architecture Overview. Our pipeline consists of three core components: (1) Volumetric Representation: Volume-Image Cross-Attention is applied to construct the Volumetric Representation from image. (2) Spatial Token Generation: Important features are extracted to form spatial tokens. (3) Multi-Token Decoder: The spatial tokens are utilized as conditions for the multi-token denoising decoder.
  • Figure 3: Visualization of Tasks from ManiSkill. The arrows indicate the sequential steps required to complete each task.
  • Figure 4: Weights Visualization. Active 3D voxels are projected onto the image plane; brighter red indicates higher activation. The learned weights concentrate on the end effector and the target object, indicating task-relevant spatial awareness.
  • Figure 5: Real World Setup. Left: the real-world robot. Right: the real-world tasks.