Table of Contents
Fetching ...

Submodular Reinforcement Learning

Manish Prajapat, Mojmír Mutný, Melanie N. Zeilinger, Andreas Krause

TL;DR

The paper tackles reinforcement learning with non-additive, submodular rewards defined over trajectories (SMDP). It proves hardness of approximation in general, showing SubRL cannot be approximated within any constant factor; introduces SubPO, a policy-gradient method that greedily maximizes marginal gains. Under restricted settings such as $\epsilon$-Bandit SMDP or bounded curvature, SubPO achieves constant-factor approximations (e.g., $1-1/e$ or $1-c$). It demonstrates practical efficacy across diverse tasks, including biodiversity monitoring, Bayesian experimental design, informative path planning, and robotics-like simulations, highlighting a bridge between submodular optimization and RL with scalable, sample-efficient learning.

Abstract

In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $\textit{independent}$ of states visited previously. In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously. To tackle this, we propose $\textit{submodular RL}$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns. Unfortunately, in general, even in tabular settings, we show that the resulting optimization problem is hard to approximate. On the other hand, motivated by the success of greedy algorithms in classical submodular optimization, we propose SubPO, a simple policy gradient-based algorithm for SubRL that handles non-additive rewards by greedily maximizing marginal gains. Indeed, under some assumptions on the underlying Markov Decision Process (MDP), SubPO recovers optimal constant factor approximations of submodular bandits. Moreover, we derive a natural policy gradient approach for locally optimizing SubRL instances even in large state- and action- spaces. We showcase the versatility of our approach by applying SubPO to several applications, such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces.

Submodular Reinforcement Learning

TL;DR

The paper tackles reinforcement learning with non-additive, submodular rewards defined over trajectories (SMDP). It proves hardness of approximation in general, showing SubRL cannot be approximated within any constant factor; introduces SubPO, a policy-gradient method that greedily maximizes marginal gains. Under restricted settings such as -Bandit SMDP or bounded curvature, SubPO achieves constant-factor approximations (e.g., or ). It demonstrates practical efficacy across diverse tasks, including biodiversity monitoring, Bayesian experimental design, informative path planning, and robotics-like simulations, highlighting a bridge between submodular optimization and RL with scalable, sample-efficient learning.

Abstract

In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are of states visited previously. In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously. To tackle this, we propose (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns. Unfortunately, in general, even in tabular settings, we show that the resulting optimization problem is hard to approximate. On the other hand, motivated by the success of greedy algorithms in classical submodular optimization, we propose SubPO, a simple policy gradient-based algorithm for SubRL that handles non-additive rewards by greedily maximizing marginal gains. Indeed, under some assumptions on the underlying Markov Decision Process (MDP), SubPO recovers optimal constant factor approximations of submodular bandits. Moreover, we derive a natural policy gradient approach for locally optimizing SubRL instances even in large state- and action- spaces. We showcase the versatility of our approach by applying SubPO to several applications, such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces.
Paper Structure (19 sections, 14 theorems, 24 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 14 theorems, 24 equations, 8 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

For any deterministic MDP with a fixed initial state, the optimal Markovian policy achieves the same value as the optimal non-Markovian policy.

Figures (8)

  • Figure 1: In \ref{['fig: gorilla-env']}, for monitoring biodiversity, a drone needs to plan a path with maximum coverage over critical areas, represented by lighter regions in a heatmap. Here, the additional information (coverage) provided by visiting a location (state) depends on which states have been visited before. We therefore must visit diverse locations that maximize coverage of important regions. In \ref{['fig: steiner_covering']}, the environment contains a group of items ($g_i$) placed on a grid. The agent must find a trajectory ($\tau$) that picks a fixed number of items $d_i$ from each group $g_i$, i.e., $\max_{\tau} \sum_i \min(|\tau\cap g_i|,d_i)$. If the agent picks more than $d_i$, it is not rewarded -- diminishing gain. Both of these tasks cannot be represented with additive rewards (in terms of locations) and serve as illustrative examples for this work.
  • Figure 2: Submodular Policy Optimization (SubPO)
  • Figure 3: Comparison of SubPO-M, SubPO-NM and ModPO. We observe that ModPO get stuck by repeatedly maximizing its modular reward, whereas SubPO-M achieves comparable performance to SubPO-NM while being more sample efficient. (Y-axis: normalized $J(\pi)$, X-axis: epochs)
  • Figure 4: Challenging tasks modelled via submodular reward functions. Primarily, the agent at location $s$ senses a region, $D^{s}$ and seeks a policy to maximize submodular rewards $F(\tau) = |\cup_{s\in \tau}D^{s}|$. a) The agent, starting from the middle, must learn to explore both rooms. b) The car must learn to drive & finish the racing lap c) Ant must learn to walk to cover the maximum 2D space around itself.
  • Figure 5: a) Coverage in building exploration, SubPO-NM tracks history and can explore the other room b) Car trained with SubPO-M learns to drive through the track (Y-axis: normalized [0-start & 1-finish]) c) Ant trained with SubPO-M learns to explore the domain (Y-axis: normalized with the domain area). Both b,c) show that SubPO scales very well to high dimensional continuous domains.
  • ...and 3 more figures

Theorems & Definitions (24)

  • Proposition 1
  • Proposition 1
  • Theorem 1
  • Theorem 1
  • Definition 1: DR submodularity and DR-property, bian2017continuous
  • Definition 2: $\epsilon$-Bandit SMDP
  • Theorem 1
  • Proposition 1
  • Proposition 1
  • proof
  • ...and 14 more