Table of Contents
Fetching ...

Tensor-Efficient High-Dimensional Q-learning

Junyi Wu, Dan Li

TL;DR

TEQL addresses learning in high-dimensional RL by combining a CP low-rank tensor representation of the Q-function with Error-Uncertainty Guided Exploration (EUGE) and a frequency-based penalty to improve sample efficiency. The method represents $\mathcal{Q}$ as $\mathcal{Q} \approx \sum_{r=1}^R \mathbf{f}_{1,r} \otimes \cdots \otimes \mathbf{f}_{N,r}$, achieving an effective parameter count $d_{\text{eff}} = RN$ and updating via block coordinate descent; exploration incorporates both decomposition error and visitation counts. A theoretical regret bound $\mathbb{E}[\mathrm{Regret}_T] = \tilde{\mathcal{O}}(\sqrt{d_{\text{eff}} T})$ is established under structural assumptions, with the penalty term balancing exploration and exploitation. Empirically, TEQL outperforms original tensor-based methods and deep RL baselines on Pendulum-v1 and CartPole-v1, with ablations confirming the benefits of the frequency penalty and EUGE and sensitivity analyses highlighting the trade-offs with discretization granularity.

Abstract

High-dimensional reinforcement learning faces challenges with complex calculations and low sample efficiency in large state-action spaces. Q-learning algorithms struggle particularly with the curse of dimensionality, where the number of state-action pairs grows exponentially with problem size. While neural network-based approaches like Deep Q-Networks have shown success, recent tensor-based methods using low-rank decomposition offer more parameter-efficient alternatives. Building upon existing tensor-based methods, we propose Tensor-Efficient Q-Learning (TEQL), which enhances low-rank tensor decomposition via improved block coordinate descent on discretized state-action spaces, incorporating novel exploration and regularization mechanisms. The key innovation is an exploration strategy that combines approximation error with visit count-based upper confidence bound to prioritize actions with high uncertainty, avoiding wasteful random exploration. Additionally, we incorporate a frequency-based penalty term in the objective function to encourage exploration of less-visited state-action pairs and reduce overfitting to frequently visited regions. Empirical results on classic control tasks demonstrate that TEQL outperforms conventional matrix-based methods and deep RL approaches in both sample efficiency and total rewards, making it suitable for resource-constrained applications, such as space and healthcare where sampling costs are high.

Tensor-Efficient High-Dimensional Q-learning

TL;DR

TEQL addresses learning in high-dimensional RL by combining a CP low-rank tensor representation of the Q-function with Error-Uncertainty Guided Exploration (EUGE) and a frequency-based penalty to improve sample efficiency. The method represents as , achieving an effective parameter count and updating via block coordinate descent; exploration incorporates both decomposition error and visitation counts. A theoretical regret bound is established under structural assumptions, with the penalty term balancing exploration and exploitation. Empirically, TEQL outperforms original tensor-based methods and deep RL baselines on Pendulum-v1 and CartPole-v1, with ablations confirming the benefits of the frequency penalty and EUGE and sensitivity analyses highlighting the trade-offs with discretization granularity.

Abstract

High-dimensional reinforcement learning faces challenges with complex calculations and low sample efficiency in large state-action spaces. Q-learning algorithms struggle particularly with the curse of dimensionality, where the number of state-action pairs grows exponentially with problem size. While neural network-based approaches like Deep Q-Networks have shown success, recent tensor-based methods using low-rank decomposition offer more parameter-efficient alternatives. Building upon existing tensor-based methods, we propose Tensor-Efficient Q-Learning (TEQL), which enhances low-rank tensor decomposition via improved block coordinate descent on discretized state-action spaces, incorporating novel exploration and regularization mechanisms. The key innovation is an exploration strategy that combines approximation error with visit count-based upper confidence bound to prioritize actions with high uncertainty, avoiding wasteful random exploration. Additionally, we incorporate a frequency-based penalty term in the objective function to encourage exploration of less-visited state-action pairs and reduce overfitting to frequently visited regions. Empirical results on classic control tasks demonstrate that TEQL outperforms conventional matrix-based methods and deep RL approaches in both sample efficiency and total rewards, making it suitable for resource-constrained applications, such as space and healthcare where sampling costs are high.

Paper Structure

This paper contains 19 sections, 2 theorems, 39 equations, 9 figures, 3 algorithms.

Key Result

Lemma 1

Suppose rewards are bounded $|r| \leq R_{\max}$ and the discount factor $\gamma \in (0,1)$. Under simplifying assumptions (e.g., approximate independence of state-action observations), for any time step $t$ and confidence parameter $\delta \in (0,1)$, if action $a_t$ is chosen via EUGE, then heurist where $a_t^* = \arg\max_a Q^*(s_t, a)$ is the optimal action, $Q_{\text{error}, t}(s_t, a_t) = |\ha

Figures (9)

  • Figure 1: Framework of the Tensor-Efficient $Q$-Learning (TEQL) Algorithm.
  • Figure 2: Comparison of TEQL (with frequency penalty and EUGE) against Original TLR in Pendulum (top) and CartPole (bottom). TEQL achieves faster convergence and higher sample efficiency.
  • Figure 3: Sample efficiency analysis showing episodes required to reach performance thresholds. TEQL consistently shows faster empirical convergence than TLR across all environments and performance targets. Box plots show median (center line), 25th-75th percentiles (box), and outliers (circles).
  • Figure 4: Ablation study comparing TEQL with and without the frequency penalty in Pendulum (top) and CartPole (bottom). The penalty enhances initial learning speed and convergence stability.
  • Figure 5: Sample efficiency analysis for ablation study comparing TEQL with and without frequency penalty. Box plots show episodes required to reach 80%, 90%, and 95% performance thresholds in Pendulum (top) and CartPole (bottom) environments. TEQL with penalty consistently achieves faster convergence across all performance targets, with compressed distributions indicating more reliable learning behavior.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Lemma 1: Heuristic EUGE Optimality Gap
  • Proposition 1: Expected Regret of TEQL in the Infinite-Horizon Discounted Setting
  • proof
  • proof
  • proof