Table of Contents
Fetching ...

MOVE: A Simple Motion-Based Data Collection Paradigm for Spatial Generalization in Robotic Manipulation

Huanqian Wang, Chi Bene Chen, Yang Yue, Danhua Tao, Tong Guo, Shaoxuan Xie, Denghang Huang, Shiji Song, Guocai Yao, Gao Huang

TL;DR

The paper addresses the critical challenge of spatial generalization in robotic manipulation under data scarcity. It introduces MOVE, a motion-based data collection paradigm that augments demonstrations with dynamic translations, rotations, and camera motion to densely cover spatial configurations, trained within a diffusion-policy framework using DDIM. Across simulated Meta-World tasks and real-world experiments, MOVE consistently improves spatial generalization and data efficiency, achieving substantial relative gains over static data collection and matching longer static data budgets with shorter dynamic ones. Ablation studies confirm the value of combining multiple dynamic dimensions and show robustness to augmentation hyperparameters. The work suggests that dynamic data collection can meaningfully reduce data requirements for robust spatial generalization in robotics, with potential extensions to more complex tasks and viewpoints.

Abstract

Imitation learning method has shown immense promise for robotic manipulation, yet its practical deployment is fundamentally constrained by the data scarcity. Despite prior work on collecting large-scale datasets, there still remains a significant gap to robust spatial generalization. We identify a key limitation: individual trajectories, regardless of their length, are typically collected from a \emph{single, static spatial configuration} of the environment. This includes fixed object and target spatial positions as well as unchanging camera viewpoints, which significantly restricts the diversity of spatial information available for learning. To address this critical bottleneck in data efficiency, we propose \textbf{MOtion-Based Variability Enhancement} (\emph{MOVE}), a simple yet effective data collection paradigm that enables the acquisition of richer spatial information from dynamic demonstrations. Our core contribution is an augmentation strategy that injects motion into any movable objects within the environment for each demonstration. This process implicitly generates a dense and diverse set of spatial configurations within a single trajectory. We conduct extensive experiments in both simulation and real-world environments to validate our approach. For example, in simulation tasks requiring strong spatial generalization, \emph{MOVE} achieves an average success rate of 39.1\%, a 76.1\% relative improvement over the static data collection paradigm (22.2\%), and yields up to 2--5$\times$ gains in data efficiency on certain tasks. Our code is available at https://github.com/lucywang720/MOVE.

MOVE: A Simple Motion-Based Data Collection Paradigm for Spatial Generalization in Robotic Manipulation

TL;DR

The paper addresses the critical challenge of spatial generalization in robotic manipulation under data scarcity. It introduces MOVE, a motion-based data collection paradigm that augments demonstrations with dynamic translations, rotations, and camera motion to densely cover spatial configurations, trained within a diffusion-policy framework using DDIM. Across simulated Meta-World tasks and real-world experiments, MOVE consistently improves spatial generalization and data efficiency, achieving substantial relative gains over static data collection and matching longer static data budgets with shorter dynamic ones. Ablation studies confirm the value of combining multiple dynamic dimensions and show robustness to augmentation hyperparameters. The work suggests that dynamic data collection can meaningfully reduce data requirements for robust spatial generalization in robotics, with potential extensions to more complex tasks and viewpoints.

Abstract

Imitation learning method has shown immense promise for robotic manipulation, yet its practical deployment is fundamentally constrained by the data scarcity. Despite prior work on collecting large-scale datasets, there still remains a significant gap to robust spatial generalization. We identify a key limitation: individual trajectories, regardless of their length, are typically collected from a \emph{single, static spatial configuration} of the environment. This includes fixed object and target spatial positions as well as unchanging camera viewpoints, which significantly restricts the diversity of spatial information available for learning. To address this critical bottleneck in data efficiency, we propose \textbf{MOtion-Based Variability Enhancement} (\emph{MOVE}), a simple yet effective data collection paradigm that enables the acquisition of richer spatial information from dynamic demonstrations. Our core contribution is an augmentation strategy that injects motion into any movable objects within the environment for each demonstration. This process implicitly generates a dense and diverse set of spatial configurations within a single trajectory. We conduct extensive experiments in both simulation and real-world environments to validate our approach. For example, in simulation tasks requiring strong spatial generalization, \emph{MOVE} achieves an average success rate of 39.1\%, a 76.1\% relative improvement over the static data collection paradigm (22.2\%), and yields up to 2--5 gains in data efficiency on certain tasks. Our code is available at https://github.com/lucywang720/MOVE.

Paper Structure

This paper contains 27 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: We uniformly sample 10 trajectories from each of the 9 points across the entire space using both static data collection and MOVE. To ensure a fair comparison, we enforce that the grasping point of each MOVE trajectory corresponds to that of a static trajectory and the same total number of timesteps. Despite this alignment, MOVE exhibits significantly better spatial generalization to unseen grasp points (29.5% vs. 80.8%).
  • Figure 2: An overview of the MOVE data collection paradigm. (Left) A conceptual comparison between the standard static data collection paradigm and MOVE. The former samples from discrete, fixed spatial configurations, where each trajectory represents a single point in the spatial configuration space. In contrast, each trajectory collected by MOVE is treated as a continuous segment, with objects, targets, and the camera in motion, resulting in a dense and diverse set of spatial configurations within a single trajectory. Therefore, with the same number of trajectories, our approach encodes broader spatial coverage and richer spatial information. (Right Top) In real-world environments, policies trained with data collected via MOVE demonstrate superior performance and generalization compared to the traditional data collection paradigm, with a maximum improvement of 4.0x on the normalized score. (Right Bottom) We demonstrate several forms of motion augmentation employed in MOVE, including translation, rotation and camera motion.
  • Figure 3: Generalization Comparison from Dense Sampling. To ensure a fair comparison, we enforce that the grasping point of each MOVE trajectory corresponds to that of a static trajectory. Despite being exposed to the same set of grasping positions during training, MOVE exhibits significantly better spatial generalization to unseen grasp points (66% vs. 74%).
  • Figure 4: Efficient scaling with demonstrations. Success rate across 10 simulation tasks. Specifically, the x-axis represents the number of timesteps, where each timestep corresponds to a single robot action, rather than the number of trajectories. MOVE consistently outperforms the static data collection paradigm at each data scaling point.
  • Figure 5: We uniformly sample grasping points evenly distributed on a circle and constrain the object's motion path within this circle for MOVE. While being exposed to the same grasping positions and constrained within the circle, MOVE exhibits significantly better spatial generalization on both the in-circle region (21% vs. 44%) and the out-of-circle region (45% vs. 67%).
  • ...and 4 more figures