Table of Contents
Fetching ...

Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces

Heiko Hoppe, Fabian Akkerman, Wouter van Heeswijk, Maximilian Schiffer

TL;DR

DGRL is proposed, combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10 actions, and demonstrates performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.

Abstract

Reinforcement Learning is increasingly applied to logistics, scheduling, and recommender systems, but standard algorithms struggle with the curse of dimensionality in such large discrete action spaces. Existing algorithms typically rely on restrictive grid-based structures or computationally expensive nearest-neighbor searches, limiting their effectiveness in high-dimensional or irregularly structured domains. We propose Distance-Guided Reinforcement Learning (DGRL), combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10$^\text{20}$ actions. Unlike prior methods, SDN leverages a semantic embedding space to perform stochastic volumetric exploration, provably providing full support over a local trust region. Complementing this, DBU transforms policy optimization into a stable regression task, decoupling gradient variance from action space cardinality and guaranteeing monotonic policy improvement. DGRL naturally generalizes to hybrid continuous-discrete action spaces without requiring hierarchical dependencies. We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.

Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces

TL;DR

DGRL is proposed, combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10 actions, and demonstrates performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.

Abstract

Reinforcement Learning is increasingly applied to logistics, scheduling, and recommender systems, but standard algorithms struggle with the curse of dimensionality in such large discrete action spaces. Existing algorithms typically rely on restrictive grid-based structures or computationally expensive nearest-neighbor searches, limiting their effectiveness in high-dimensional or irregularly structured domains. We propose Distance-Guided Reinforcement Learning (DGRL), combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10 actions. Unlike prior methods, SDN leverages a semantic embedding space to perform stochastic volumetric exploration, provably providing full support over a local trust region. Complementing this, DBU transforms policy optimization into a stable regression task, decoupling gradient variance from action space cardinality and guaranteeing monotonic policy improvement. DGRL naturally generalizes to hybrid continuous-discrete action spaces without requiring hierarchical dependencies. We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.
Paper Structure (46 sections, 14 theorems, 41 equations, 8 figures, 7 tables, 2 algorithms)

This paper contains 46 sections, 14 theorems, 41 equations, 8 figures, 7 tables, 2 algorithms.

Key Result

Proposition 3.1

Let $Q(s, \cdot): \mathcal{A}' \to \mathbb{R}$ be $L_Q$-Lipschitz continuous w.r.t. a metric $d$. Let $a^\star$ be the optimal discrete action, $\hat{a}$ be a continuous proto-action, and $\bar{a}$ be a target action. The value loss of rounding $\hat{a}$ to its nearest neighbor $a_\mathrm{nn}$ is bo

Figures (8)

  • Figure 1: Schematic representation of SDN.
  • Figure 2: Schematic representation of DBU.
  • Figure 3: Results for discrete environments, averaged over 10 random seeds. Titles indicate size and type (structured or irregular) of action space. Legend indicates mapping method and RL algorithm.
  • Figure 4: Results for hybrid environments, averaged over 10 random seeds. Titles indicate size and type (structured or irregular) of action space. Legend indicates mapping method and RL algorithm.
  • Figure 5: Schematic overview over full algorithm.
  • ...and 3 more figures

Theorems & Definitions (23)

  • Proposition 3.1: Approximation Bound via Lipschitz Continuity
  • Proposition 3.2: Dimensional Invariance of Chebyshev Neighborhoods
  • Proposition 5.1: Volumetric vs. Axial Support
  • Theorem 5.2: Removal of Action-Cardinality Dependence
  • Proposition 5.3: Trust Region Projection
  • Remark 5.4: Approximate Coordinate Ascent
  • Proposition 1.1: Approximation Bound via Lipschitz Continuity
  • proof
  • Proposition 1.2: Dimensional Invariance of Chebyshev Neighborhoods
  • proof
  • ...and 13 more