Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods

Riccardo De Santi; Manish Prajapat; Andreas Krause

Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods

Riccardo De Santi, Manish Prajapat, Andreas Krause

TL;DR

This work addresses the limitations of additive state-based rewards in reinforcement learning by introducing Global Reinforcement Learning (GRL), where rewards are defined over entire trajectories via a global set function $F:2^{\mathcal{S}\times\mathcal{T}}\to\mathbb{R}$. It develops a meta-algorithm that linearizes $F$ with tight modular lower bounds and solves a sequence of standard MDPs (Global Trajectory Optimization, GTO, and Global Policy Optimization, GPO), yielding curvature-based approximation guarantees tied to submodular/supermodular/BP structures. The authors prove hardness results for GRL, and demonstrate effectiveness across tasks like D-optimal design, diverse-synergy trajectory selection, and safe state coverage on grid worlds, highlighting improved exploration, design quality, and safety trade-offs. Overall, GRL provides a principled way to model and optimize non-additive, interaction-rich objectives in finite-horizon decision-making, with practical impact for experiments design, exploration, imitation learning, and risk-aware planning.

Abstract

In classic Reinforcement Learning (RL), the agent maximizes an additive objective of the visited states, e.g., a value function. Unfortunately, objectives of this type cannot model many real-world applications such as experiment design, exploration, imitation learning, and risk-averse RL to name a few. This is due to the fact that additive objectives disregard interactions between states that are crucial for certain tasks. To tackle this problem, we introduce Global RL (GRL), where rewards are globally defined over trajectories instead of locally over states. Global rewards can capture negative interactions among states, e.g., in exploration, via submodularity, positive interactions, e.g., synergetic effects, via supermodularity, while mixed interactions via combinations of them. By exploiting ideas from submodular optimization, we propose a novel algorithmic scheme that converts any GRL problem to a sequence of classic RL problems and solves it efficiently with curvature-dependent approximation guarantees. We also provide hardness of approximation results and empirically demonstrate the effectiveness of our method on several GRL instances.

Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods

TL;DR

. It develops a meta-algorithm that linearizes

with tight modular lower bounds and solves a sequence of standard MDPs (Global Trajectory Optimization, GTO, and Global Policy Optimization, GPO), yielding curvature-based approximation guarantees tied to submodular/supermodular/BP structures. The authors prove hardness results for GRL, and demonstrate effectiveness across tasks like D-optimal design, diverse-synergy trajectory selection, and safe state coverage on grid worlds, highlighting improved exploration, design quality, and safety trade-offs. Overall, GRL provides a principled way to model and optimize non-additive, interaction-rich objectives in finite-horizon decision-making, with practical impact for experiments design, exploration, imitation learning, and risk-aware planning.

Abstract

Paper Structure (27 sections, 12 theorems, 64 equations, 12 figures, 2 tables, 2 algorithms)

This paper contains 27 sections, 12 theorems, 64 equations, 12 figures, 2 tables, 2 algorithms.

Introduction
Preliminaries
Global Reinforcement Learning (GRL)
GRL as a Subset Selection Problem
Relation with Convex RL
Fundamental Limitation of Convex RL
Exploiting Structure in Global RL
Semi-gradient Method for GRL
Approximation guarantees and Hardness
How good is a modular approximation?
Hardness of Global RL
Experiments
Experimental Insights and Observations
Related Work
Conclusion
...and 12 more sections

Key Result

proposition 1

Given an instance $\mathcal{I^+}$ of ST-CRL it is possible to reduce it to an instance $\mathcal{I_+}$ of GRL eq:global_reinforcement_learning.

Figures (12)

Figure 1: The agent has visited trajectory $\tau_t$ and must select the next state. On the left, the agent aims to estimate an unknown state function $f$: re-visiting $s_2$ leads to a negative interaction since the information gain has diminishing returns. On the right, the agent seeks a trajectory, i.e., ordered set of atoms, maximizing synergies seen as positive interactions among certain combinations of atoms e.g., adding $s_3$ to $\tau_t=\{s_1,s_2\}$ leads to a synergetic effect.
Figure 2: In GTO, at any step $t$, we construct $m_{\tau_t}$, a tight modular lower bound about $\tau_t$ and optimize the resulting classic MDP, which results in an improved trajectory $\tau_{t+1}$.
Figure 3: We compare GTO and GPO with the optimal policy for the modularized objective $F_m$ (MOD). We observe that MOD performs sub-optimally as its objective cannot capture interactions between states. The alternative versions of the algorithm tested are presented in Section \ref{['sec:alternative_LB']}. (Y-axis: $\mathcal{J}(\pi)$, X-axis: iterations)
Figure 4: States Coverage: (left) values of $F(\tau)$ in deterministic GMDP setting where $\tau$ is the trajectory computed by GTO at each iteration (x-axis), which matches the optimal non-Markovian policy. (right) values of $\mathcal{J}(\pi)$ in stochastic GMDP setting, where $\pi$ is the policy computed by GPO at each iteration (x-axis).
Figure 5: State Coverage, $H=31, 35$ iterations: (left) trajectory $\tau_1$ induced by output policy of GTO using GTO-S lower bounds achieves $F(\tau_1) = 56$, (right) trajectory $\tau_2$ induced by output policy of GTO using GTO-greedy-S lower bounds achieves $F(\tau_2) = 64$. GTO-greedy-S outperforms GTO-S in those instances where the horizon is just enough to reach optimality.
...and 7 more figures

Theorems & Definitions (30)

definition 1: Global Markov Decision Process
proposition 1: Single Trial Convex RL $\subseteq$ Global RL
definition 2: Submodular rewards
definition 3: Supermodular rewards
definition 4: BP rewards
definition 5: Non-additive Suboptimality Gap
definition 6: Submodular and Supermodular Curvature
theorem 7.0: Approximation Guarantees
theorem 7.0: Hardness of GRL, trajectory-optimization \ref{['eq:global_reinforcement_learning_traj']}
definition 7: Time-extended CMP
...and 20 more

Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods

TL;DR

Abstract

Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (30)