Table of Contents
Fetching ...

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Jiamin He, A. Rupam Mahmood, Martha White

TL;DR

A novel reinforcement learning framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment is introduced, and a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space.

Abstract

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

Distributions as Actions: A Unified Framework for Diverse Action Spaces

TL;DR

A novel reinforcement learning framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment is introduced, and a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space.

Abstract

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

Paper Structure

This paper contains 73 sections, 13 theorems, 44 equations, 16 figures, 13 tables, 7 algorithms.

Key Result

Proposition 3.0

Under assm:nice_set, ${\bar{v}}_{\bar{\pi}}(s)=v_\pi(s)$ and ${\bar{q}}_{\bar{\pi}}(s,{u})=\mathbb{E}_{A\sim f(\cdot|{u})} \bigl[q_\pi(s,A) \bigr]$.

Figures (16)

  • Figure 1: Comparison between the classical reinforcement learning (RL) framework and the proposed distributions-as-actions framework. In the classical RL setting (col 1), the agent's policy ${\pi_{\boldsymbol{\mathbf{\theta}}}}$ consists of ${{\bar{\pi}}_{\boldsymbol{\mathbf{\theta}}}}$, which produces the distribution parameters, and a sampling function $f$ that returns an action given these parameters. In the distributions-as-actions framework (col 2), the sampling function $f$ is considered part of the environment, and the agent outputs the distribution parameters ${{\bar{\pi}}_{\boldsymbol{\mathbf{\theta}}}}(S_t)$ as its action. The sampled action $A_t$ may optionally be observable to the agent, though it is not required for the core formulation. This shift redefines the interface between agent and environment, potentially simplifying learning and enabling new algorithmic perspectives.
  • Figure 2: Visualization of the reward function (col 1), expected rewards of distribution parameters (col 2), and learned critics using the standard update in \ref{['eq:q_loss_or_gradient_transformed']} (col 3) and the Interpolated Critic Learning (ICL) update in \ref{['eq:q_loss_or_gradient_transformed_linear']} (col 4) in policy evaluation (PE). Top: K-Armed Bandit. Bottom: Bimodal Continuous Bandit. With access only to samples from the PE policy (the fixed policy being evaluated), the standard update estimates values accurately at that policy but fails to generalize beyond it. In contrast, the ICL update learns a critic that captures curvature information useful for policy optimization.
  • Figure 3: Relative final performance of DA-AC versus TD3 across $20$ individual continuous control tasks (col 1), and average normalized returns of DA-AC and baselines on MuJoCo (col 2) and DeepMind Control (col 3) tasks. In individual task comparisons (col 1), results are averaged over $10$ seeds per task. For average performance plots (cols 2-3), values are averaged over $10$ seeds and tasks. Error bars show $95\%$ bootstrap confidence intervals (CIs).
  • Figure 4: Learning curves in six DeepMind Control tasks with high-dimensional action spaces. Results are averaged over $10$ seeds. Shaded regions show $95\%$ bootstrap CIs.
  • Figure 5: Average normalized returns of DA-AC and baselines on discrete control benchmarks, including classic control (col 1), MinAtar (col 2), discretized MuJoCo (col 3), and discretized DeepMind Control (col 4) tasks.
  • ...and 11 more figures

Theorems & Definitions (19)

  • Proposition 3.0
  • Theorem 4.1: Distributions-as-actions policy gradient theorem
  • Proposition 4.1
  • Proposition 4.1
  • Proposition 4.1
  • Proposition C.0
  • proof
  • Theorem C.1: Distributions-as-actions policy gradient theorem
  • proof
  • Proposition C.0
  • ...and 9 more