Table of Contents
Fetching ...

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng

TL;DR

Heterogeneous-agent Lyapunov Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric, improves generalization and robustness in collaborative corner cases.

Abstract

To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

TL;DR

Heterogeneous-agent Lyapunov Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric, improves generalization and robustness in collaborative corner cases.

Abstract

To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.
Paper Structure (29 sections, 2 theorems, 22 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 22 equations, 8 figures, 10 tables, 1 algorithm.

Key Result

Theorem 5.2

Under Assumption ass:smoothness, let $\{ \bm{\theta}_k \}_{k=0}^\infty$ be the sequence of parameters generated by the HALyPO update law. If the learning rate $\eta$ satisfies the stability bound $\eta \le 2\sigma V(\bm{\theta}_k) / (L \| \mathbf{d}^*_k \|_2^2)$, then the rationality gap $V(\bm{\the

Figures (8)

  • Figure 1: The HALyPO framework architecture combining the transition from standard decentralized learning to Lyapunov policy optimization for real-world HRC. Key components include the computation of the rationality gap $V(\theta)$ and the stability normal vector $h$ to derive the final analytic closed-form projection $d^*$.
  • Figure 2: Simulation benchmark and learning dynamics: (a) massively parallelized training infrastructure in Isaac Lab, where the arrows indicate the emergent synergy collaboration; (b) performance comparison across nine scenarios, where HALyPO demonstrates significantly faster convergence, reaching its performance plateau at approximately 1.3B steps.
  • Figure 3: Comparison of HALyPO and baseline MARL algorithms across the nine scenarios in OSP, SCT and SLH tasks.
  • Figure 4: Optimization dynamics analysis: (a) monotonic dissipation of $V(\bm{\theta})$ under the Lyapunov stability certificate; (b) rapid convergence of gradient alignment. HALyPO eliminates solenoidal components to stabilize the joint parameter manifold.
  • Figure 5: Scalability and algorithm metrics analysis: (a) convergence steps required to reach performance plateau; (b) steady-state rationality gap $V$; (c) final gradient alignment $\cos \phi$. (d) gradient conflict rate across algorithms.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 5.2: Monotonicity of potential decay
  • proof : Summary
  • Theorem 5.3: Convergence to the synergy manifold
  • proof : Summary