Generalizing from References using a Multi-Task Reference and Goal-Driven RL Framework

Jiashun Wang; M. Eva Mungai; He Li; Jean Pierre Sleiman; Jessica Hodgins; Farbod Farshidian

Generalizing from References using a Multi-Task Reference and Goal-Driven RL Framework

Jiashun Wang, M. Eva Mungai, He Li, Jean Pierre Sleiman, Jessica Hodgins, Farbod Farshidian

TL;DR

A unified multi-task RL framework that bridges the gap by treating reference motion as a prior for behavioral shaping rather than a deployment-time constraint is introduced, and long-horizon behavior generation is demonstrated by composing multiple learned skills, illustrating the flexibility of the learned polices in complex scenarios.

Abstract

Learning agile humanoid behaviors from human motion offers a powerful route to natural, coordinated control, but existing approaches face a persistent trade-off: reference-tracking policies are often brittle outside the demonstration dataset, while purely task-driven Reinforcement Learning (RL) can achieve adaptability at the cost of motion quality. We introduce a unified multi-task RL framework that bridges this gap by treating reference motion as a prior for behavioral shaping rather than a deployment-time constraint. A single goal-conditioned policy is trained jointly on two tasks that share the same observation and action spaces, but differ in their initialization schemes, command spaces, and reward structures: (i) a reference-guided imitation task in which reference trajectories define dense imitation rewards but are not provided as policy inputs, and (ii) a goal-conditioned generalization task in which goals are sampled independently of any reference and where rewards reflect only task success. By co-optimizing these objectives within a shared formulation, the policy acquires structured, human-like motor skills from dense reference supervision while learning to adapt these skills to novel goals and initial conditions. This is achieved without adversarial objectives, explicit trajectory tracking, phase variables, or reference-dependent inference. We evaluate the method on a challenging box-based parkour playground that demands diverse athletic behaviors (e.g., jumping and climbing), and show that the learned controller transfers beyond the reference distribution while preserving motion naturalness. Finally, we demonstrate long-horizon behavior generation by composing multiple learned skills, illustrating the flexibility of the learned polices in complex scenarios.

Generalizing from References using a Multi-Task Reference and Goal-Driven RL Framework

TL;DR

Abstract

Paper Structure (25 sections, 10 equations, 4 figures, 8 tables)

This paper contains 25 sections, 10 equations, 4 figures, 8 tables.

Introduction
Methodology
Overview
Design Rationale
Training Setup and MDP Formulation
Evaluation
How Robust and Generalizable Are the Learned Policies in Simulation and on Hardware?
How Does Our Method Compare to Tabula Rasa RL and Pure Motion Imitation?
Can the Learned Skills Be Composed to Solve Long-Horizon Parkour Scenarios?
Which Components Are Necessary for the Pipeline to Work?
Conclusion
Observations and Goal Representation
Goal representation.
Goal definition relative to the box.
Reward Function
...and 10 more sections

Figures (4)

Figure 1: A humanoid robot performs human-like walking, jumping, and climbing behaviors in a box-based environment.
Figure 2: Success rate of our method under different initial conditions for walk-climb and walk-jump skills. When varying one initial condition, all other conditions are held at their nominal values. Orange markers show the nominal configuration of the initial state, while the purple markers show the randomness level of the initialization during training. The gray rectangle represents the box with its edge positioned at 2.3m.
Figure 3: Hardware experiments with varied initial conditions for the walk–climb, walk–jump, and climb–down skills. Each behavior is depicted from two different initial conditions. Despite changes in initial conditions, the robot adapts its strategy and successfully executes the skills.
Figure 4: Multi-skill composition in a sim-to-sim evaluation in MuJoCo. Learned policies are composed to execute walk-climb, walk-jump, and climb-down behaviors over long horizons.

Generalizing from References using a Multi-Task Reference and Goal-Driven RL Framework

TL;DR

Abstract

Generalizing from References using a Multi-Task Reference and Goal-Driven RL Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (4)