Table of Contents
Fetching ...

Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity

Yan Yang, Bin Gao, Ya-xiang Yuan

TL;DR

This work develops a fully first-order framework for bilevel reinforcement learning by deriving the hyper-gradient without assuming lower-level convexity, using the fixed-point structure of entropy-regularized RL. It introduces model-based (M-SoBiRL) and model-free (SoBiRL) algorithms, and extends to stochastic settings (Stoc-SoBiRL), with convergence guarantees: $\mathcal{O}(\epsilon^{-1})$ for deterministic methods and $\widetilde{\mathcal{O}}(\epsilon^{-1.5})$ outer iterations and $\widetilde{\mathcal{O}}(\epsilon^{-3.5})$ samples for the stochastic variant. The hyper-gradient combines exploitation and exploration, enabling joint optimization of reward shaping and RLHF-style objectives while requiring only first-order oracles. Empirical results on RLHF tasks (Atari, Mujoco) and a synthetic BiRL problem corroborate the effectiveness and scalability of the approach, highlighting the practical impact of first-order BiRL methods in complex hierarchical RL settings.

Abstract

Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we design both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms enjoy the convergence rate $O(ε^{-1})$. To extend the applicability, a stochastic version of the model-free algorithm is proposed, along with results on its iteration and sample complexity. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.

Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity

TL;DR

This work develops a fully first-order framework for bilevel reinforcement learning by deriving the hyper-gradient without assuming lower-level convexity, using the fixed-point structure of entropy-regularized RL. It introduces model-based (M-SoBiRL) and model-free (SoBiRL) algorithms, and extends to stochastic settings (Stoc-SoBiRL), with convergence guarantees: for deterministic methods and outer iterations and samples for the stochastic variant. The hyper-gradient combines exploitation and exploration, enabling joint optimization of reward shaping and RLHF-style objectives while requiring only first-order oracles. Empirical results on RLHF tasks (Atari, Mujoco) and a synthetic BiRL problem corroborate the effectiveness and scalability of the approach, highlighting the practical impact of first-order BiRL methods in complex hierarchical RL settings.

Abstract

Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we design both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms enjoy the convergence rate . To extend the applicability, a stochastic version of the model-free algorithm is proposed, along with results on its iteration and sample complexity. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.
Paper Structure (38 sections, 44 theorems, 271 equations, 4 figures, 2 tables, 5 algorithms)

This paper contains 38 sections, 44 theorems, 271 equations, 4 figures, 2 tables, 5 algorithms.

Key Result

Proposition 4.1

For any $x\in\mathbb{R}^n$, $\varphi(x,\cdot)$ is a contraction mapping, i.e., $\|\nabla_v \varphi(x,v)\|_{\infty}=\gamma<1$, and the matrix ${I-\nabla_{v}\varphi(x,v)}$ is invertible. Consequently, $V^*(x)$ is the unique fixed point of $\varphi(x,\cdot)$, with a well-defined derivative $\nabla V^*( Additionally, $\nabla_v \varphi\left( {x,V^*(x)} \right)$ coincides with the $\gamma$-scaled transi

Figures (4)

  • Figure 1: Comparison of algorithms on the Atari game, BeamRider, evaluated by the ground-truth reward. Each bilevel algorithm collects a total of $3000$ trajectory pairs. The running average over $15$ consecutive episodes is adopted for the presentation, and results are averaged over $5$ seeds.
  • Figure 2: Comparison on the Atari games---Seaquest and SpaceInvaders---evaluated by the ground-truth reward. Each bilevel algorithm collects a total of $3000$ trajectory pairs. The results are averaged over $5$ seeds.
  • Figure 3: Comparison of algorithms on the Mujoco simulations---HalfCheetah, Walker2d, and Hopper---evaluated by the ground-truth reward. Each bilevel algorithm collects a total of $3000$ trajectory pairs. The results are averaged over $5$ seeds.
  • Figure 4: A synthetic bilevel RL problem to verify the model-based algorithm, M-SoBiRL. Metrics are the hyper-gradient norm $\left\| {\nabla \phi(x)} \right\| _2$ and the upper-level loss $f(x,\pi)$.

Theorems & Definitions (83)

  • Proposition 4.1
  • Proposition 5.1
  • Proposition 5.2: Hyper-gradient
  • Theorem 7.4: Model-based
  • Proposition 7.5
  • Theorem 7.6: Model-free
  • Theorem 7.7: Stochastic
  • Proposition C.1
  • proof
  • Definition C.2
  • ...and 73 more