Table of Contents
Fetching ...

SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning

Anlun Huang, Zhenyu Wu, Soofiyan Atar, Yuheng Zhi, Michael Yip

TL;DR

ReST-RL is introduced, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark and demonstrates highly reliable zero-shot sim-to-real generalization across various objects and external force disturbances.

Abstract

Stabilizing unsecured payloads against the inherent oscillations of dynamic bipedal locomotion remains a critical engineering bottleneck for humanoids in unstructured environments. To solve this, we introduce ReST-RL, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark. Rather than relying on monolithic end-to-end learning, our framework integrates a robust base locomotion policy with a dynamic residual module engineered to actively cancel gait-induced perturbations at the end-effector. This architectural separation ensures steady tray transport without degrading the underlying bipedal stability. In simulation, the residual design significantly outperforms end-to-end baselines in gait smoothness and orientation accuracy, achieving a 96.9% success rate in variable velocity tracking and 74.5% robustness against external force disturbances. Successfully deployed on the Unitree G1 humanoid hardware, this modular approach demonstrates highly reliable zero-shot sim-to-real generalization across various objects and external force disturbances.

SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning

TL;DR

ReST-RL is introduced, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark and demonstrates highly reliable zero-shot sim-to-real generalization across various objects and external force disturbances.

Abstract

Stabilizing unsecured payloads against the inherent oscillations of dynamic bipedal locomotion remains a critical engineering bottleneck for humanoids in unstructured environments. To solve this, we introduce ReST-RL, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark. Rather than relying on monolithic end-to-end learning, our framework integrates a robust base locomotion policy with a dynamic residual module engineered to actively cancel gait-induced perturbations at the end-effector. This architectural separation ensures steady tray transport without degrading the underlying bipedal stability. In simulation, the residual design significantly outperforms end-to-end baselines in gait smoothness and orientation accuracy, achieving a 96.9% success rate in variable velocity tracking and 74.5% robustness against external force disturbances. Successfully deployed on the Unitree G1 humanoid hardware, this modular approach demonstrates highly reliable zero-shot sim-to-real generalization across various objects and external force disturbances.
Paper Structure (28 sections, 4 equations, 9 figures, 7 tables)

This paper contains 28 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: ReST-RL enables a Unitree G1 humanoid to perform the SteadyTray task in a real-world setting, with a fluid-filled wine glass as one of the payload. The robot keeps the tray level to prevent fluid sloshing, glass tipping, and payload falling during transport.
  • Figure 2: Overview of the ReST-RL framework. Base Policy Training: A locomotion policy is first trained to carry a tray while maintaining a stable gait. Residual Module Training: using privileged observations, a residual module learns whole-body corrective adjustments on top of the frozen base policy to stabilize the payload under disturbances. Two residual designs are considered: (a) Residual Action Adapter, which adds corrective residual actions to the base action, and (b) Residual FiLM Adapter, which modulates intermediate activations of the frozen base policy via layer-wise FiLM residuals. The student encoder distillation process is shown in Fig. 3.
  • Figure 3: Residual module distillation. The teacher encoder uses privileged observations, whereas the student encoder uses object-centric inputs; both feed into a frozen residual adapter for latent alignment.
  • Figure 4: Training reward comparison between End2End and ReST-RL.
  • Figure 5: Success rate of ReST-RL trained with and without observation delay under increasing perception latency in Push Robot task.
  • ...and 4 more figures