Table of Contents
Fetching ...

BAT: Balancing Agility and Stability via Online Policy Switching for Long-Horizon Whole-Body Humanoid Control

Donghoon Baek, Sang-Hun Kim, Sehoon Ha

Abstract

Despite recent advances in control, reinforcement learning, and imitation learning, developing a unified framework that can achieve agile, precise, and robust whole-body behaviors, particularly in long-horizon tasks, remains challenging. Existing approaches typically follow two paradigms: coupled whole-body policies for global coordination and decoupled policies for modular precision. However, without a systematic method to integrate both, this trade-off between agility, robustness, and precision remains unresolved. In this work, we propose BAT, an online policy-switching framework that dynamically selects between two complementary whole-body RL controllers to balance agility and stability across different motion contexts. Our framework consists of two complementary modules: a switching policy learned via hierarchical RL with an expert guidance from sliding-horizon policy pre-evaluation, and an option-aware VQ-VAE that predicts option preference from discrete motion token sequences for improved generalization. The final decision is obtained via confidence-weighted fusion of two modules. Extensive simulations and real-world experiments on the Unitree G1 humanoid robot demonstrate that BAT enables versatile long-horizon loco-manipulation and outperforms prior methods across diverse tasks.

BAT: Balancing Agility and Stability via Online Policy Switching for Long-Horizon Whole-Body Humanoid Control

Abstract

Despite recent advances in control, reinforcement learning, and imitation learning, developing a unified framework that can achieve agile, precise, and robust whole-body behaviors, particularly in long-horizon tasks, remains challenging. Existing approaches typically follow two paradigms: coupled whole-body policies for global coordination and decoupled policies for modular precision. However, without a systematic method to integrate both, this trade-off between agility, robustness, and precision remains unresolved. In this work, we propose BAT, an online policy-switching framework that dynamically selects between two complementary whole-body RL controllers to balance agility and stability across different motion contexts. Our framework consists of two complementary modules: a switching policy learned via hierarchical RL with an expert guidance from sliding-horizon policy pre-evaluation, and an option-aware VQ-VAE that predicts option preference from discrete motion token sequences for improved generalization. The final decision is obtained via confidence-weighted fusion of two modules. Extensive simulations and real-world experiments on the Unitree G1 humanoid robot demonstrate that BAT enables versatile long-horizon loco-manipulation and outperforms prior methods across diverse tasks.

Paper Structure

This paper contains 29 sections, 26 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Conceptual overview of BAT. The decoupled policy (blue) provides stable and disturbance-robust behaviors, while the coupled policy (red) enables agile and dynamic motions. BAT adaptively switches between them to achieve both stability and agility.
  • Figure 2: Analysis of switching behavior and motion robustness. (a) t-SNE visualization of motion outcomes. Since robustness (success vs. failure) is prioritized, policy $\pi_D$ achieves substantially more successful executions than $\pi_C$. (b) Success distribution across dynamic motion levels. Both methods perform well at low dynamic levels, while $\pi_C$ achieves relatively more successes as the motion becomes more dynamic. (c) Qualitative comparisons across three scenarios: (1) external push recovery, (2) rough terrain walking, and (3) static and dynamic motion execution including squatting, running, jumping, and kicking (blue: $\pi_D$, red: $\pi_C$).
  • Figure 3: Overview of BAT. (1) Option-Aware VQ-VAE learns a discrete motion representation via codebooks, jointly trained with next token prediction, reconstruction, and option prediction objectives. The resulting option-aware latent tokens serve directly as input to the option prediction module. (2) Offline Data Construction applies sliding-horizon option evaluation over retargeted motion data from two policies ($\pi_D$, $\pi_C$), generating high-quality switching demonstrations data $\mathcal{D}_{Op}$ via motion blending with inertialization. (3) Option-Guided Hierarchical RL trains a high-level switching policy that selects between $\pi_D$ and $\pi_C$, executed by the low-level policy manager. Learning is bootstrapped from $\mathcal{D}_{Op}$ via BC-guided exploration for sample-efficient training. (4) Decision Fusion Module integrates all three modules, leveraging the complementary uncertainty characteristics of three modules for decision-making.
  • Figure 4: Controller-specific token sequence distributions. Each point represents a token sequence plotted by $P(\mathrm{seq}\mid\pi_D)$ and $P(\mathrm{seq}\mid\pi_C)$, where marker size denotes sequence frequency. Colors indicate $\pi_D$-only, $\pi_C$-only, and shared sequences (numbers show counts). Left: vanilla VQ-VAE produces many shared sequences, indicating that the learned tokens are not strongly aligned with controller preference. Right: option-aware VQ-VAE yields clearer controller-specific token separation, demonstrating that the token space better captures controller-dependent motion structure relevant for switching.
  • Figure 5: Comparison of our method (BAT) with various switching strategies in simulation. Top: training set (in-distribution, 400 motion combinations). Bottom: test set (unseen transitions, 100 combinations). BAT denotes the full method. Opt-HRL uses option guidance with HRL, while Opt-HRL-i trains on individual motions without transition exposure. Opt-BC uses option guidance with behavior cloning only (no HRL), and HRL corresponds to unguided hierarchical RL. Additional baselines include GT-Opt (oracle option selection), Opt-Pred (option prediction), Heur (human heuristics), and fixed decoupled ($\pi_D$) and coupled ($\pi_C$) policies.
  • ...and 2 more figures