Tensor-Efficient High-Dimensional Q-learning
Junyi Wu, Dan Li
TL;DR
TEQL addresses learning in high-dimensional RL by combining a CP low-rank tensor representation of the Q-function with Error-Uncertainty Guided Exploration (EUGE) and a frequency-based penalty to improve sample efficiency. The method represents $\mathcal{Q}$ as $\mathcal{Q} \approx \sum_{r=1}^R \mathbf{f}_{1,r} \otimes \cdots \otimes \mathbf{f}_{N,r}$, achieving an effective parameter count $d_{\text{eff}} = RN$ and updating via block coordinate descent; exploration incorporates both decomposition error and visitation counts. A theoretical regret bound $\mathbb{E}[\mathrm{Regret}_T] = \tilde{\mathcal{O}}(\sqrt{d_{\text{eff}} T})$ is established under structural assumptions, with the penalty term balancing exploration and exploitation. Empirically, TEQL outperforms original tensor-based methods and deep RL baselines on Pendulum-v1 and CartPole-v1, with ablations confirming the benefits of the frequency penalty and EUGE and sensitivity analyses highlighting the trade-offs with discretization granularity.
Abstract
High-dimensional reinforcement learning faces challenges with complex calculations and low sample efficiency in large state-action spaces. Q-learning algorithms struggle particularly with the curse of dimensionality, where the number of state-action pairs grows exponentially with problem size. While neural network-based approaches like Deep Q-Networks have shown success, recent tensor-based methods using low-rank decomposition offer more parameter-efficient alternatives. Building upon existing tensor-based methods, we propose Tensor-Efficient Q-Learning (TEQL), which enhances low-rank tensor decomposition via improved block coordinate descent on discretized state-action spaces, incorporating novel exploration and regularization mechanisms. The key innovation is an exploration strategy that combines approximation error with visit count-based upper confidence bound to prioritize actions with high uncertainty, avoiding wasteful random exploration. Additionally, we incorporate a frequency-based penalty term in the objective function to encourage exploration of less-visited state-action pairs and reduce overfitting to frequently visited regions. Empirical results on classic control tasks demonstrate that TEQL outperforms conventional matrix-based methods and deep RL approaches in both sample efficiency and total rewards, making it suitable for resource-constrained applications, such as space and healthcare where sampling costs are high.
