Table of Contents
Fetching ...

Recursive Backwards Q-Learning in Deterministic Environments

Jan Diekhoff, Jörn Fischer

TL;DR

This work tackles the inefficiency of model-free Q-learning in deterministic, episodic tasks by introducing Recursive Backwards Q-Learning (RBQL), a model-based agent that builds an environment model during exploration and then propagates optimal value information backwards from terminal states via a backward evaluation. RBQL updates, particularly with $\alpha=1$, lead to $Q(S_t,A_t) = R_{t+1} + \gamma \max_a Q(S_{t+1},a)$, effectively spreading the terminal reward through explored paths using a breadth-first search. Empirical results on grid-world mazes of varying sizes show that RBQL reduces average steps and variance compared to standard Q-learning and accelerates reaching optimal policies, including up to large-scale mazes like $50\times50$. The work demonstrates the practicality of model-based backward propagation for deterministic problems and points to future research on generalizing the approach to non-deterministic environments and more complex task structures.

Abstract

Reinforcement learning is a popular method of finding optimal solutions to complex problems. Algorithms like Q-learning excel at learning to solve stochastic problems without a model of their environment. However, they take longer to solve deterministic problems than is necessary. Q-learning can be improved to better solve deterministic problems by introducing such a model-based approach. This paper introduces the recursive backwards Q-learning (RBQL) agent, which explores and builds a model of the environment. After reaching a terminal state, it recursively propagates its value backwards through this model. This lets each state be evaluated to its optimal value without a lengthy learning process. In the example of finding the shortest path through a maze, this agent greatly outperforms a regular Q-learning agent.

Recursive Backwards Q-Learning in Deterministic Environments

TL;DR

This work tackles the inefficiency of model-free Q-learning in deterministic, episodic tasks by introducing Recursive Backwards Q-Learning (RBQL), a model-based agent that builds an environment model during exploration and then propagates optimal value information backwards from terminal states via a backward evaluation. RBQL updates, particularly with , lead to , effectively spreading the terminal reward through explored paths using a breadth-first search. Empirical results on grid-world mazes of varying sizes show that RBQL reduces average steps and variance compared to standard Q-learning and accelerates reaching optimal policies, including up to large-scale mazes like . The work demonstrates the practicality of model-based backward propagation for deterministic problems and points to future research on generalizing the approach to non-deterministic environments and more complex task structures.

Abstract

Reinforcement learning is a popular method of finding optimal solutions to complex problems. Algorithms like Q-learning excel at learning to solve stochastic problems without a model of their environment. However, they take longer to solve deterministic problems than is necessary. Q-learning can be improved to better solve deterministic problems by introducing such a model-based approach. This paper introduces the recursive backwards Q-learning (RBQL) agent, which explores and builds a model of the environment. After reaching a terminal state, it recursively propagates its value backwards through this model. This lets each state be evaluated to its optimal value without a lengthy learning process. In the example of finding the shortest path through a maze, this agent greatly outperforms a regular Q-learning agent.
Paper Structure (11 sections, 5 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 11 sections, 5 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Basic agent-environment relationship in a Markov decision process. The agent chooses an action $A_t$ and the environment returns a new state $S_{t+1}$ and a reward $R_{t+1}$. The dotted line represents the transition from step $t$ to step $t+1$suttonbarto.
  • Figure 2: Q-learning in a one-dimensional grid world. All Q-values are initialized as $-1$. Actions that lead to the terminal state reward 10. All other actions reward -1. The discount rate $\gamma$ is set to $0.9$. The learning rate $\alpha$ is set to $0.5$. The value of $\epsilon$ is irrelevant as the only action the agent takes is $\rightarrow$.
  • Figure 3: Number of steps taken to find the goal in a randomly generated grid world maze of size $5\times5$. The blue line is the minimum step threshold for any maze of this size. The light red area shows the range of Q-learning agent's highest and lowest step count, excluding the highest and lowest two. The red line shows the average performance. Similarly, the light green area shows the range of the RBQL agent's highest and lowest step count, excluding the highest and lowest two, and the green line shows the average performance.
  • Figure 4: Number of steps taken to find the goal in a randomly generated grid world maze of size $10\times10$. The light red area shows the range of Q-learning agent's highest and lowest step count, excluding the highest and lowest two. The red shows the average performance. Similarly, the light green area shows the range of the RBQL agent's highest and lowest step count, excluding the highest and lowest two, and the green line shows the average performance.
  • Figure 5: Number of steps taken to find the goal in a randomly generated grid world maze of size $15\times15$. The light red area shows the range of Q-learning agent's highest and lowest step count, excluding the highest and lowest two. The red line shows the average performance. Similarly, the light green area shows the range of the RBQL agent's highest and lowest step count, excluding the highest and lowest two, and the green line shows the average performance.
  • ...and 1 more figures