Table of Contents
Fetching ...

Is Diversity All You Need for Scalable Robotic Manipulation?

Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li

TL;DR

This work interrogates how data diversity influences scalable robotic manipulation, challenging the notion that more data alone yields better generalization. It shows that task diversity in pre-training yields stronger transfer than simply increasing per-task demonstrations, and that models trained on a single embodiment can transfer across embodiments with favorable scaling properties. It also reveals that expert demonstration diversity introduces velocity multimodality that can confound learning, and it resolves this via a velocity-based distribution debiasing approach, achieving GO-1-Pro with about 15% performance gains equivalent to 2.5× more data. Collectively, the results provide practical guidance for constructing scalable, cross-domain robotic datasets and training pipelines with improved data efficiency.

Abstract

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

Is Diversity All You Need for Scalable Robotic Manipulation?

TL;DR

This work interrogates how data diversity influences scalable robotic manipulation, challenging the notion that more data alone yields better generalization. It shows that task diversity in pre-training yields stronger transfer than simply increasing per-task demonstrations, and that models trained on a single embodiment can transfer across embodiments with favorable scaling properties. It also reveals that expert demonstration diversity introduces velocity multimodality that can confound learning, and it resolves this via a velocity-based distribution debiasing approach, achieving GO-1-Pro with about 15% performance gains equivalent to 2.5× more data. Collectively, the results provide practical guidance for constructing scalable, cross-domain robotic datasets and training pipelines with improved data efficiency.

Abstract

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

Paper Structure

This paper contains 31 sections, 3 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: We investigate critical aspects of data diversity for robotic manipulation systematically, i.e., task, embodiment, and expert diversity. Through comprehensive evaluation in simulation and the real world, we reveal key insights that challenge conventional assumptions on data scaling. (a) Task diversity benefits policy learning with predictable power-law scaling. (b) Multi-embodiment pre-training data is optional for cross-embodiment transfer capabilities—models pre-trained on single-embodiment data can efficiently adapt to different embodiments and show more desirable scaling property during finetuning than multi-embodiment pre-trained models. (c) Expert diversity confuses robot learning, towards which we devise a distribution debiasing method based on GO-1 go1; the yielding GO-1-Pro attains superior data efficiency during both pre-training and finetuning, where it achieves substantial performance gains of 15%, equivalent to using 2.5 times the pre-training data.
  • Figure 2: Illustration of the multimodal expert behavior in task Push-T chi2023diffusion. The robot (blue circle) needs to move the gray T to the green target area. Expert demonstrations exhibit multimodality in both spatial and velocity dimensions: (a) Spatial multimodality arises from different trajectory choices, where the robot can approach T from either left or right sides, resulting in distinct spatial paths; (b) Velocity multimodality occurs when robots execute similar trajectory at different speeds, generating completely different demonstration profiles over time. Both spatial and velocity multimodal characteristics require models to learn these distributional properties in current action chunk-based imitation learning.
  • Figure 3: Distribution of atomic skills in two pre-training datasets. Task-based sampling (10% tasks) shows lower skill diversity but concentrates on the most commonly used skills, while episode-based sampling (10% episodes) demonstrates a more balanced distribution.
  • Figure 4: Real-robot evaluation of GO-1 go1 on four challenging tasks subsequent to pre-training on different datasets. The tasks assess fine-grained manipulation, deformable object handling, long-horizon planning, and contact-rich interactions respectively. Results show that episode-based sampling (10% Episode) outperforms task-based sampling (10% Task) by 0.1 in average score with the same data amount, and performance improves consistently with increased pre-training data while ensuring sufficient task diversity.
  • Figure 5: Performance scales with pre-training data size while maintaining adequate task diversity, following a predictable power-law relationship.Left: GO-1 performance scales with pre-training data size. Right: Power-law relationship between pre-training data size and model performance. The dashed line represents a power-law fit with equation $y=1.24x^{-0.08}$ and correlation coefficient $r=-0.99$, indicating a strong adherence to power-law scaling with pre-training data volume.
  • ...and 9 more figures