Table of Contents
Fetching ...

Joint MDPs and Reinforcement Learning in Coupled-Dynamics Environments

Ege C. Kaya, Mahsa Ghasemi, Abolfazl Hashemi

Abstract

Many distributional quantities in reinforcement learning are intrinsically joint across actions, including distributions of gaps and probabilities of superiority. However, the classical Markov decision process (MDP) formalism specifies only marginal laws and leaves the joint law of counterfactual one-step outcomes across multiple possible actions at a state unspecified. We study coupled-dynamics environments with a multi-action generative interface which can sample counterfactual one-step outcomes for multiple actions under shared exogenous randomness. We propose joint MDPs (JMDPs) as a formalism for such environments by augmenting an MDP with a multi-action sample transition model which specifies a coupling of one-step counterfactual outcomes, while preserving standard MDP interaction as marginal observations. We adopt and formalize a one-step coupling regime where dependence across actions is confined to immediate counterfactual outcomes at the queried state. In this regime, we derive Bellman operators for $n$th-order return moments, providing dynamic programming and incremental algorithms with convergence guarantees.

Joint MDPs and Reinforcement Learning in Coupled-Dynamics Environments

Abstract

Many distributional quantities in reinforcement learning are intrinsically joint across actions, including distributions of gaps and probabilities of superiority. However, the classical Markov decision process (MDP) formalism specifies only marginal laws and leaves the joint law of counterfactual one-step outcomes across multiple possible actions at a state unspecified. We study coupled-dynamics environments with a multi-action generative interface which can sample counterfactual one-step outcomes for multiple actions under shared exogenous randomness. We propose joint MDPs (JMDPs) as a formalism for such environments by augmenting an MDP with a multi-action sample transition model which specifies a coupling of one-step counterfactual outcomes, while preserving standard MDP interaction as marginal observations. We adopt and formalize a one-step coupling regime where dependence across actions is confined to immediate counterfactual outcomes at the queried state. In this regime, we derive Bellman operators for th-order return moments, providing dynamic programming and incremental algorithms with convergence guarantees.
Paper Structure (16 sections, 10 theorems, 76 equations, 7 figures)

This paper contains 16 sections, 10 theorems, 76 equations, 7 figures.

Key Result

Lemma 5.2

Let $\lambda := 2/(1-\gamma)$. For any moment collection $M$, define the norm Then, $T^\pi_2$ is a $\gamma$-contraction in $\left\lVert \,\cdot\,\right\rVert_{\lambda}$.

Figures (7)

  • Figure 1: A $3\times3$ WGW environment and the evaluated policy.
  • Figure 2: Bellman residual convergence in tabular coupled-dynamics environments.Left:$5\times 5$ WGW. Right: CRC with $\lvert \mathcal{S}\rvert=25$. We plot the Bellman residual $\lVert M_k - T_2^\pi M_k\rVert_\lambda$ on logarithmic scale.
  • Figure 3: Per-state action correlation matrices in WGW. Each tile corresponds to a state in a $3\times 3$ WGW, and displays the $4\times 4$ correlation matrix $\rho^\pi_s$ computed through JIPE-$2$.
  • Figure 4: CRC with $M$ states, where the two actions share the same transition dynamics while rewards are anti-correlated at each state.
  • Figure 5: Gap validation in WGW. We evaluate a fixed goal-directed policy and consider gaps between the policy’s action and alternatives. Left: Predicted vs. MC gap means $\mathbb{E}[G^\pi]$. Middle: Predicted vs. MC gap variances $\mathrm{var}(G^\pi)$. Right: Empirical cdf of the ratio $\hat{\mathbb{P}}(G^\pi\le 0)/\mathrm{Chebyshev}(G^\pi)$, which measures empirical tightness of the Chebyshev upper bound. The blue curve computes the denominator using JIPE-$2$-estimated moments, while the orange curve uses MC-estimated moments (a near-ground-truth proxy). This comparison separates two sources of looseness: (i) intrinsic looseness of the Chebyshev bound itself, and (ii) moment-estimation error. The close agreement between the two curves indicates that most of the observed looseness is due to the bound and not inaccurate moment estimation.
  • ...and 2 more figures

Theorems & Definitions (27)

  • Definition 3.1: MDP
  • Definition 3.2: Sample transition model (STM) DRL-textbook
  • Definition 4.1: $m$-JSTM
  • Definition 4.2: Coupled-dynamics environment
  • Example 1
  • Example 2
  • Definition 4.3: JMDP
  • Remark 4.4
  • Definition 4.6: Joint return vector
  • Definition 5.1: $2$nd-order joint Bellman operator
  • ...and 17 more