Table of Contents
Fetching ...

Efficient Reinforcement Learning by Guiding Generalist World Models with Non-Curated Data

Yi Zhao, Aidan Scannell, Wenshuai Zhao, Yuxin Hou, Tianyu Cui, Le Chen, Dieter Büchler, Arno Solin, Juho Kannala, Joni Pajarinen

TL;DR

This work addresses sample efficiency in offline-to-online RL by leveraging abundant non-curated, reward-free data collected across multiple embodiments. It identifies distributional shift during fine-tuning as a key bottleneck and introduces Generalist-to-Specialist Adaptation (GSA), which combines a multi-embodiment world model pre-trained on offline data with two mechanisms: experience rehearsal to retrieve and replay task-relevant trajectories, and execution guidance via a prior actor to steer exploration toward high-confidence regions. Empirically, GSA achieves a $102.8\%$ relative improvement over training-from-scratch baselines at a modest online budget on 72 visuomotor tasks, and demonstrates fast continual adaptation on a multi-task Ant suite. Theoretical insights support the design: experience retrieval reduces distribution shift, and execution-guided imitation can provably improve early-stage performance, offering a practical and scalable path to more data-efficient RL in diverse robotic domains. These results highlight the potential of non-curated offline data when properly integrated into both pre-training and fine-tuning, with implications for scalable, real-world robotic learning.

Abstract

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

Efficient Reinforcement Learning by Guiding Generalist World Models with Non-Curated Data

TL;DR

This work addresses sample efficiency in offline-to-online RL by leveraging abundant non-curated, reward-free data collected across multiple embodiments. It identifies distributional shift during fine-tuning as a key bottleneck and introduces Generalist-to-Specialist Adaptation (GSA), which combines a multi-embodiment world model pre-trained on offline data with two mechanisms: experience rehearsal to retrieve and replay task-relevant trajectories, and execution guidance via a prior actor to steer exploration toward high-confidence regions. Empirically, GSA achieves a relative improvement over training-from-scratch baselines at a modest online budget on 72 visuomotor tasks, and demonstrates fast continual adaptation on a multi-task Ant suite. Theoretical insights support the design: experience retrieval reduces distribution shift, and execution-guided imitation can provably improve early-stage performance, offering a practical and scalable path to more data-efficient RL in diverse robotic domains. These results highlight the potential of non-curated offline data when properly integrated into both pre-training and fine-tuning, with implications for scalable, real-world robotic learning.

Abstract

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

Paper Structure

This paper contains 45 sections, 3 theorems, 22 equations, 12 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Experience retrieval reduces distribution shift during online fine-tuning, compared to using the full offline dataset directly, in the sense that

Figures (12)

  • Figure 1: Overview of GSA (Generalist-to-Specialist Adaptation). Facing non-curated offline data with reward-free, mixed-quality, and multi-embodiment data, we train a task and embodiment-agnostic world model. Combined with experience rehearsal and execution guidance, the pre-trained world model improves the sample efficiency of RL training over a wide range of tasks.
  • Figure 2: Visualization of Distribution Mismatch.Left: At the early stage of fine-tuning, there is a distribution shift between offline data used for world model pre-training and online data used for RL fine-tuning, which hurts performance. Middle: Experience rehearsal mitigates the distributional shift issue. Right: Quantitatively, at the early stage of fine-tuning, experience rehearsal reduces the Wasserstein distance between the online data and both the offline and expert data.
  • Figure 3: Left: Quantitative comparison across 72 diverse tasks from Meta-World and DMControl. GSA achieves a 102.8% relative improvement in aggregate score over learning-from-scratch baselines when using the same sample budget (150k). It also matches the performance of baselines even when they are trained with substantially more samples (see \ref{['appendix:full_results']} for full results). Right: Learning curves on representative challenging locomotion and robotic manipulation tasks. GSA consistently outperforms state-of-the-art methods that leverage offline data by a decent margin. We plot the mean and corresponding 95% confidence interval.
  • Figure 4: Comparison with other world model pre-training methods. GSA outperforms state-of-the-art model-based methods without relying on techniques used in iVideoGPT, such as reward shaping and demonstration-based replay buffer initialization.
  • Figure 5: GSA enables fast task adaptation. We train an RL agent to control an Ant robot from DMControl to complete a series of tasks incrementally. GSA significantly outperforms the widely used baseline PackNet by properly leveraging non-curated offline data.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Proposition 1
  • proof
  • Proposition 2
  • Definition 1: Catastrophic Forgetting due to Data Distribution Shift
  • proof
  • Proposition 3: Performance Improvement via Execution Guidance
  • proof