Recursive Backwards Q-Learning in Deterministic Environments
Jan Diekhoff, Jörn Fischer
TL;DR
This work tackles the inefficiency of model-free Q-learning in deterministic, episodic tasks by introducing Recursive Backwards Q-Learning (RBQL), a model-based agent that builds an environment model during exploration and then propagates optimal value information backwards from terminal states via a backward evaluation. RBQL updates, particularly with $\alpha=1$, lead to $Q(S_t,A_t) = R_{t+1} + \gamma \max_a Q(S_{t+1},a)$, effectively spreading the terminal reward through explored paths using a breadth-first search. Empirical results on grid-world mazes of varying sizes show that RBQL reduces average steps and variance compared to standard Q-learning and accelerates reaching optimal policies, including up to large-scale mazes like $50\times50$. The work demonstrates the practicality of model-based backward propagation for deterministic problems and points to future research on generalizing the approach to non-deterministic environments and more complex task structures.
Abstract
Reinforcement learning is a popular method of finding optimal solutions to complex problems. Algorithms like Q-learning excel at learning to solve stochastic problems without a model of their environment. However, they take longer to solve deterministic problems than is necessary. Q-learning can be improved to better solve deterministic problems by introducing such a model-based approach. This paper introduces the recursive backwards Q-learning (RBQL) agent, which explores and builds a model of the environment. After reaching a terminal state, it recursively propagates its value backwards through this model. This lets each state be evaluated to its optimal value without a lengthy learning process. In the example of finding the shortest path through a maze, this agent greatly outperforms a regular Q-learning agent.
