Table of Contents
Fetching ...

Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods

Riccardo De Santi, Manish Prajapat, Andreas Krause

TL;DR

This work addresses the limitations of additive state-based rewards in reinforcement learning by introducing Global Reinforcement Learning (GRL), where rewards are defined over entire trajectories via a global set function $F:2^{\mathcal{S}\times\mathcal{T}}\to\mathbb{R}$. It develops a meta-algorithm that linearizes $F$ with tight modular lower bounds and solves a sequence of standard MDPs (Global Trajectory Optimization, GTO, and Global Policy Optimization, GPO), yielding curvature-based approximation guarantees tied to submodular/supermodular/BP structures. The authors prove hardness results for GRL, and demonstrate effectiveness across tasks like D-optimal design, diverse-synergy trajectory selection, and safe state coverage on grid worlds, highlighting improved exploration, design quality, and safety trade-offs. Overall, GRL provides a principled way to model and optimize non-additive, interaction-rich objectives in finite-horizon decision-making, with practical impact for experiments design, exploration, imitation learning, and risk-aware planning.

Abstract

In classic Reinforcement Learning (RL), the agent maximizes an additive objective of the visited states, e.g., a value function. Unfortunately, objectives of this type cannot model many real-world applications such as experiment design, exploration, imitation learning, and risk-averse RL to name a few. This is due to the fact that additive objectives disregard interactions between states that are crucial for certain tasks. To tackle this problem, we introduce Global RL (GRL), where rewards are globally defined over trajectories instead of locally over states. Global rewards can capture negative interactions among states, e.g., in exploration, via submodularity, positive interactions, e.g., synergetic effects, via supermodularity, while mixed interactions via combinations of them. By exploiting ideas from submodular optimization, we propose a novel algorithmic scheme that converts any GRL problem to a sequence of classic RL problems and solves it efficiently with curvature-dependent approximation guarantees. We also provide hardness of approximation results and empirically demonstrate the effectiveness of our method on several GRL instances.

Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods

TL;DR

This work addresses the limitations of additive state-based rewards in reinforcement learning by introducing Global Reinforcement Learning (GRL), where rewards are defined over entire trajectories via a global set function . It develops a meta-algorithm that linearizes with tight modular lower bounds and solves a sequence of standard MDPs (Global Trajectory Optimization, GTO, and Global Policy Optimization, GPO), yielding curvature-based approximation guarantees tied to submodular/supermodular/BP structures. The authors prove hardness results for GRL, and demonstrate effectiveness across tasks like D-optimal design, diverse-synergy trajectory selection, and safe state coverage on grid worlds, highlighting improved exploration, design quality, and safety trade-offs. Overall, GRL provides a principled way to model and optimize non-additive, interaction-rich objectives in finite-horizon decision-making, with practical impact for experiments design, exploration, imitation learning, and risk-aware planning.

Abstract

In classic Reinforcement Learning (RL), the agent maximizes an additive objective of the visited states, e.g., a value function. Unfortunately, objectives of this type cannot model many real-world applications such as experiment design, exploration, imitation learning, and risk-averse RL to name a few. This is due to the fact that additive objectives disregard interactions between states that are crucial for certain tasks. To tackle this problem, we introduce Global RL (GRL), where rewards are globally defined over trajectories instead of locally over states. Global rewards can capture negative interactions among states, e.g., in exploration, via submodularity, positive interactions, e.g., synergetic effects, via supermodularity, while mixed interactions via combinations of them. By exploiting ideas from submodular optimization, we propose a novel algorithmic scheme that converts any GRL problem to a sequence of classic RL problems and solves it efficiently with curvature-dependent approximation guarantees. We also provide hardness of approximation results and empirically demonstrate the effectiveness of our method on several GRL instances.
Paper Structure (27 sections, 12 theorems, 64 equations, 12 figures, 2 tables, 2 algorithms)

This paper contains 27 sections, 12 theorems, 64 equations, 12 figures, 2 tables, 2 algorithms.

Key Result

proposition 1

Given an instance $\mathcal{I^+}$ of ST-CRL it is possible to reduce it to an instance $\mathcal{I_+}$ of GRL eq:global_reinforcement_learning.

Figures (12)

  • Figure 1: The agent has visited trajectory $\tau_t$ and must select the next state. On the left, the agent aims to estimate an unknown state function $f$: re-visiting $s_2$ leads to a negative interaction since the information gain has diminishing returns. On the right, the agent seeks a trajectory, i.e., ordered set of atoms, maximizing synergies seen as positive interactions among certain combinations of atoms e.g., adding $s_3$ to $\tau_t=\{s_1,s_2\}$ leads to a synergetic effect.
  • Figure 2: In GTO, at any step $t$, we construct $m_{\tau_t}$, a tight modular lower bound about $\tau_t$ and optimize the resulting classic MDP, which results in an improved trajectory $\tau_{t+1}$.
  • Figure 3: We compare GTO and GPO with the optimal policy for the modularized objective $F_m$ (MOD). We observe that MOD performs sub-optimally as its objective cannot capture interactions between states. The alternative versions of the algorithm tested are presented in Section \ref{['sec:alternative_LB']}. (Y-axis: $\mathcal{J}(\pi)$, X-axis: iterations)
  • Figure 4: States Coverage: (left) values of $F(\tau)$ in deterministic GMDP setting where $\tau$ is the trajectory computed by GTO at each iteration (x-axis), which matches the optimal non-Markovian policy. (right) values of $\mathcal{J}(\pi)$ in stochastic GMDP setting, where $\pi$ is the policy computed by GPO at each iteration (x-axis).
  • Figure 5: State Coverage, $H=31, 35$ iterations: (left) trajectory $\tau_1$ induced by output policy of GTO using GTO-S lower bounds achieves $F(\tau_1) = 56$, (right) trajectory $\tau_2$ induced by output policy of GTO using GTO-greedy-S lower bounds achieves $F(\tau_2) = 64$. GTO-greedy-S outperforms GTO-S in those instances where the horizon is just enough to reach optimality.
  • ...and 7 more figures

Theorems & Definitions (30)

  • definition 1: Global Markov Decision Process
  • proposition 1: Single Trial Convex RL $\subseteq$ Global RL
  • definition 2: Submodular rewards
  • definition 3: Supermodular rewards
  • definition 4: BP rewards
  • definition 5: Non-additive Suboptimality Gap
  • definition 6: Submodular and Supermodular Curvature
  • theorem 7.0: Approximation Guarantees
  • theorem 7.0: Hardness of GRL, trajectory-optimization \ref{['eq:global_reinforcement_learning_traj']}
  • definition 7: Time-extended CMP
  • ...and 20 more