Table of Contents
Fetching ...

Convex Is Back: Solving Belief MDPs With Convexity-Informed Deep Reinforcement Learning

Daniel Koutas, Daniel Hettegger, Kostas G. Papakonstantinou, Daniel Straub

TL;DR

This paper tackles belief-space DRL for POMDPs by exploiting the convexity of the optimal value function over beliefs, proposing hard- and soft-enforced convexity within a Dueling Q-network framework. The authors demonstrate that incorporating convexity constraints can improve learning speed, robustness to hyperparameters, and extrapolation to out-of-distribution observations, with gradient-based soft enforcement often performing best. Experiments on Tiger and FVRS show notable gains in OOD settings and robust performance across problem variants, suggesting that well-behaved value-function extrapolation is beneficial in partially observable domains. The work provides a practical approach to enhance DRL in belief-based settings and points to future directions in high-dimensional belief spaces and actor-critic architectures.

Abstract

We present a novel method for Deep Reinforcement Learning (DRL), incorporating the convex property of the value function over the belief space in Partially Observable Markov Decision Processes (POMDPs). We introduce hard- and soft-enforced convexity as two different approaches, and compare their performance against standard DRL on two well-known POMDP environments, namely the Tiger and FieldVisionRockSample problems. Our findings show that including the convexity feature can substantially increase performance of the agents, as well as increase robustness over the hyperparameter space, especially when testing on out-of-distribution domains. The source code for this work can be found at https://github.com/Dakout/Convex_DRL.

Convex Is Back: Solving Belief MDPs With Convexity-Informed Deep Reinforcement Learning

TL;DR

This paper tackles belief-space DRL for POMDPs by exploiting the convexity of the optimal value function over beliefs, proposing hard- and soft-enforced convexity within a Dueling Q-network framework. The authors demonstrate that incorporating convexity constraints can improve learning speed, robustness to hyperparameters, and extrapolation to out-of-distribution observations, with gradient-based soft enforcement often performing best. Experiments on Tiger and FVRS show notable gains in OOD settings and robust performance across problem variants, suggesting that well-behaved value-function extrapolation is beneficial in partially observable domains. The work provides a practical approach to enhance DRL in belief-based settings and points to future directions in high-dimensional belief spaces and actor-critic architectures.

Abstract

We present a novel method for Deep Reinforcement Learning (DRL), incorporating the convex property of the value function over the belief space in Partially Observable Markov Decision Processes (POMDPs). We introduce hard- and soft-enforced convexity as two different approaches, and compare their performance against standard DRL on two well-known POMDP environments, namely the Tiger and FieldVisionRockSample problems. Our findings show that including the convexity feature can substantially increase performance of the agents, as well as increase robustness over the hyperparameter space, especially when testing on out-of-distribution domains. The source code for this work can be found at https://github.com/Dakout/Convex_DRL.

Paper Structure

This paper contains 30 sections, 28 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Dueling network architecture, with the belief input (black), dense layers (gray), value stream (cyan), advantage stream (green) and Q-value output (red). Arrows indicate dense weights and the brown lines indicate computation without weights; adapted from wang2016dueling.
  • Figure 2: Boxplots (color-coded) over all optimal agents of a hyperparameter search with 200 runs for each convexity method. An optimal agent is one which has reached the optimal policy in the given amount of training steps, and the number of optimal agents was for grad: 178, hard: 68, hess: 69, None: 193, point: 183. The agents have been trained on $p_{obs}=1.0$ and cross-evaluated on $p_{obs}=\{0.5, 0.6, 0.7, 0.8, 0.9, 1.0\}$ with $10^5$ MC samples. Each boxplot includes the median as a blue horizontal line, interquartile range (IQR) as an opaque colored box, as well as the $1.5\cdot$IQR distances from the respective quartiles as whiskers; the maximum achieved value is marked with a colored hollow circle, other outliers are not visualized to avoid cluttering.
  • Figure 3: Best agents (color-coded) evaluated for 10 runs for each convexity method. The agents have been trained on the default observation function and are cross-evaluated on the heaviside (heavi) and a set of $p_{obs}=\{0.5, 0.6, 0.7, 0.8, 0.9, 1.0\}$ constant observation functions with $10^4$ MC samples. The figure shows the respective reward means (solid horizontal line) as well as $\pm$ 1 standard deviation (transparent bars).
  • Figure 4: Best agents (color-coded) evaluated for 10 runs for each convexity method. The agents have been trained on the heaviside observation function and are cross-evaluated on the default (def) and a set of $p_{obs}=\{0.5, 0.6, 0.7, 0.8, 0.9, 1.0\}$ constant observation functions with $10^4$ MC samples. The figure shows the respective reward means (solid horizontal line) as well as $\pm$ 1 standard deviation (transparent bars).
  • Figure A.1: Tiger value function plot over the belief space for 6 example agents trained without (a) and with gradient-based convexity enforcement (b) on $p_{obs}=1.0$ .
  • ...and 3 more figures