HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Hao Zhang; Yaru Niu; Yikai Wang; Ding Zhao; H. Eric Tseng

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng

TL;DR

Heterogeneous-agent Lyapunov Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric, improves generalization and robustness in collaborative corner cases.

Abstract

To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

TL;DR

Abstract

Paper Structure (29 sections, 2 theorems, 22 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 22 equations, 8 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Decentralized POMDPs
Decoupled CTDE and the stationarity assumption
Learning dynamics and rationality gap
Methodology: The HALyPO Framework
Vector field misalignment and Lyapunov stability
Structural stability and analytic projection
Scalability via Hessian-vector product
Theoretical Analysis
Monotonic descent of the rationality gap
Asymptotic convergence to equilibrium
Experiments and Results
Experimental setup
...and 14 more sections

Key Result

Theorem 5.2

Under Assumption ass:smoothness, let $\{ \bm{\theta}_k \}_{k=0}^\infty$ be the sequence of parameters generated by the HALyPO update law. If the learning rate $\eta$ satisfies the stability bound $\eta \le 2\sigma V(\bm{\theta}_k) / (L \| \mathbf{d}^*_k \|_2^2)$, then the rationality gap $V(\bm{\the

Figures (8)

Figure 1: The HALyPO framework architecture combining the transition from standard decentralized learning to Lyapunov policy optimization for real-world HRC. Key components include the computation of the rationality gap $V(\theta)$ and the stability normal vector $h$ to derive the final analytic closed-form projection $d^*$.
Figure 2: Simulation benchmark and learning dynamics: (a) massively parallelized training infrastructure in Isaac Lab, where the arrows indicate the emergent synergy collaboration; (b) performance comparison across nine scenarios, where HALyPO demonstrates significantly faster convergence, reaching its performance plateau at approximately 1.3B steps.
Figure 3: Comparison of HALyPO and baseline MARL algorithms across the nine scenarios in OSP, SCT and SLH tasks.
Figure 4: Optimization dynamics analysis: (a) monotonic dissipation of $V(\bm{\theta})$ under the Lyapunov stability certificate; (b) rapid convergence of gradient alignment. HALyPO eliminates solenoidal components to stabilize the joint parameter manifold.
Figure 5: Scalability and algorithm metrics analysis: (a) convergence steps required to reach performance plateau; (b) steady-state rationality gap $V$; (c) final gradient alignment $\cos \phi$. (d) gradient conflict rate across algorithms.
...and 3 more figures

Theorems & Definitions (4)

Theorem 5.2: Monotonicity of potential decay
proof : Summary
Theorem 5.3: Convergence to the synergy manifold
proof : Summary

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

TL;DR

Abstract

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)