Table of Contents
Fetching ...

Analysis of approximate linear programming solution to Markov decision problem with log barrier function

Donghwan Lee, Hyukjun Yang, Bum Geun Park

TL;DR

The paper develops a theoretical foundation for solving LP-based MDPs by embedding the inequality constraints into a log-barrier term, converting the LP into an unconstrained optimization problem. It proves that the barrier-augmented solution $\tilde{Q}_\eta$ converges to the true optimum $Q^*$ with errors that scale linearly in the barrier weight $\eta$, and it characterizes both primal and dual approximate policies. The authors establish convexity, local strong convexity, and exponential convergence of gradient descent in the tabular setting, and extend the framework to deep RL via a log-barrier loss that yields competitive results with DQN and improved performance for DDPG in several tasks. This approach offers a principled, gradient-based alternative to primal–dual LP methods and provides insights for stable, barrier-informed learning in offline and constrained RL contexts.

Abstract

There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear simple, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.

Analysis of approximate linear programming solution to Markov decision problem with log barrier function

TL;DR

The paper develops a theoretical foundation for solving LP-based MDPs by embedding the inequality constraints into a log-barrier term, converting the LP into an unconstrained optimization problem. It proves that the barrier-augmented solution converges to the true optimum with errors that scale linearly in the barrier weight , and it characterizes both primal and dual approximate policies. The authors establish convexity, local strong convexity, and exponential convergence of gradient descent in the tabular setting, and extend the framework to deep RL via a log-barrier loss that yields competitive results with DQN and improved performance for DDPG in several tasks. This approach offers a principled, gradient-based alternative to primal–dual LP methods and provides insights for stable, barrier-informed learning in offline and constrained RL contexts.

Abstract

There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear simple, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.

Paper Structure

This paper contains 46 sections, 22 theorems, 191 equations, 20 figures, 4 tables, 4 algorithms.

Key Result

Lemma 1

The dual problem of the LP eq:primal-LP1 is given by

Figures (20)

  • Figure 1: Learning curves comparing the Log-barrier DQN and standard DQN on the Gymnasium control environments. Each curve represents the average return over 10 random seeds, with the shaded area indicating one standard deviation from the mean.
  • Figure 2: Learning curves comparing the Log-barrier DDPG and standard DDPG on the Mujoco continuous control environments. Each curve represents the average return over 8 random seeds, with the shaded area indicating one standard deviation from the mean.
  • Figure 3: Comparison of the log-barrier coefficient $\eta$ on the convergence rate. The plot shows the max-norm error, $\|Q^* - \tilde{Q}_\eta\|_{\infty}$, between the learned and optimal Q-functions versus the number of training iterations. These results were obtained from experiments on a 6$\times$6 FrozenLake-v1 environment, where the ground-truth $Q^*$ was pre-computed using value iteration.
  • Figure 4: Comparison of the error evolutions with $\eta = 0.001$ for different learning rates. The plot shows the max-norm error, $\|Q^* - \tilde{Q}_\eta\|_{\infty}$, between the learned and optimal Q-functions versus the number of training iterations. These results were obtained from experiments on a 6$\times$6 FrozenLake-v1 environment, where the ground-truth $Q^*$ was pre-computed using value iteration.
  • Figure 5: Comparison of the error evolutions with $\eta = 0.005$ for different learning rates. The plot shows the max-norm error, $\|Q^* - \tilde{Q}_\eta\|_{\infty}$, between the learned and optimal Q-functions versus the number of training iterations. These results were obtained from experiments on a 6$\times$6 FrozenLake-v1 environment, where the ground-truth $Q^*$ was pre-computed using value iteration.
  • ...and 15 more figures

Theorems & Definitions (47)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Lemma 3
  • proof
  • Theorem 3
  • proof
  • Corollary 1
  • proof
  • ...and 37 more