Table of Contents
Fetching ...

Learning Emergent Gaits with Decentralized Phase Oscillators: on the role of Observations, Rewards, and Feedback

Jenny Zhang, Steve Heim, Se Hwan Jeon, Sangbae Kim

TL;DR

The paper introduces a minimal quadruped locomotion framework based on four decentralized phase oscillators per leg, each receiving local ground reaction force feedback as an observer gain to estimate stance/swing state. By incorporating phase observations and phase-based gait rewards, the approach enables emergent gait preferences without prescribing a fixed gait, with coupling between oscillator dynamics and GRF further accelerating convergence and enabling adaptation to perturbations. Comprehensive ablations show that combining all three signals yields balanced leg use, robust disturbance rejection, and faster gait emergence, while rewards can strongly influence stability even when observations are non-Markovian. The method offers a scalable route toward gait emergence with potential benefits for sim-to-real transfer and hierarchical RL, where the phase oscillators serve as a latent cyclic state for temporal abstraction and coordination.

Abstract

We present a minimal phase oscillator model for learning quadrupedal locomotion. Each of the four oscillators is coupled only to itself and its corresponding leg through local feedback of the ground reaction force, which can be interpreted as an observer feedback gain. We interpret the oscillator itself as a latent contact state-estimator. Through a systematic ablation study, we show that the combination of phase observations, simple phase-based rewards, and the local feedback dynamics induces policies that exhibit emergent gait preferences, while using a reduced set of simple rewards, and without prescribing a specific gait. The code is open-source, and a video synopsis available at https://youtu.be/1NKQ0rSV3jU.

Learning Emergent Gaits with Decentralized Phase Oscillators: on the role of Observations, Rewards, and Feedback

TL;DR

The paper introduces a minimal quadruped locomotion framework based on four decentralized phase oscillators per leg, each receiving local ground reaction force feedback as an observer gain to estimate stance/swing state. By incorporating phase observations and phase-based gait rewards, the approach enables emergent gait preferences without prescribing a fixed gait, with coupling between oscillator dynamics and GRF further accelerating convergence and enabling adaptation to perturbations. Comprehensive ablations show that combining all three signals yields balanced leg use, robust disturbance rejection, and faster gait emergence, while rewards can strongly influence stability even when observations are non-Markovian. The method offers a scalable route toward gait emergence with potential benefits for sim-to-real transfer and hierarchical RL, where the phase oscillators serve as a latent cyclic state for temporal abstraction and coordination.

Abstract

We present a minimal phase oscillator model for learning quadrupedal locomotion. Each of the four oscillators is coupled only to itself and its corresponding leg through local feedback of the ground reaction force, which can be interpreted as an observer feedback gain. We interpret the oscillator itself as a latent contact state-estimator. Through a systematic ablation study, we show that the combination of phase observations, simple phase-based rewards, and the local feedback dynamics induces policies that exhibit emergent gait preferences, while using a reduced set of simple rewards, and without prescribing a specific gait. The code is open-source, and a video synopsis available at https://youtu.be/1NKQ0rSV3jU.
Paper Structure (12 sections, 5 equations, 7 figures, 1 table)

This paper contains 12 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: We augment the robot state with four decentralized phase oscillators, one per leg. Blue arrows in the diagram indicate the three oscillator-related signals: first, observations of the oscillator phase to the policy to make feed-forward Markov. Second, the phase-based reward encodes the general properties of gaits. Finally, the ground reaction force ($F^{\text{GRF}}$) is used as feedback, which we view as the observer feedback that allows us to interpret the phase oscillators as a state observer of whether each foot should be in stance or swing. The scissors represent our ablation study.
  • Figure 2: To illustrate the oscillator fixed points (where $\dot{\phi}=0$), we graph the oscillator dynamics with $F^{\text{GRF}}=0.25$, where each leg supports a quarter of the body weight. For the curve with coupling $\sigma=4$ and offset $\xi=0$, the point at phase $\phi=2\pi$ is only marginally stable, and would not settle in stance. When $\xi=0$, the limit of the fixed point as $\sigma$ approaches $+\infty$ is $3\pi/2$, but drastically increasing $\sigma$ alone introduces discrete jumps in $\phi$ that are destabilizing. Setting $\xi=1$ with $\sigma=4$ caps $\dot{\phi}$ at the nominal $2\pi\omega$, and places the fixed point directly in the middle of the stance phase. This formulation helps to achieve standing without stepping in place.
  • Figure 3: Each ORC configuration has 500 agents (50 per re-trained policy), and $F^{\text{GRF}}$ is averaged for each leg across the entire episode. ORC(11x) policies show much more consistent and balanced leg use compared to all other configurations, which tend to exhibit 2 or 3 legged gaits.
  • Figure 4: Initial and final relative phase difference RPD points are shown for 500 randomly initialized runs in each experiment. ORC(111) evaluated with coupling $\sigma=0$ tracks the initial phases and cannot converge to any specific gait. ORC(111) evaluated with $\sigma=1$ exhibits strong convergence to trot and pronk, while ORC(110) evaluated with $\sigma=1$ exhibits some convergence around trot, but is more spread out compared to the final ORC(111) RPD.
  • Figure 5: The distribution of gaits for 500 runs of ORC(111) with $\sigma=1$ settles into both trot and bound quickly within 10 seconds.
  • ...and 2 more figures