Table of Contents
Fetching ...

Subgoal Discovery Using a Free Energy Paradigm and State Aggregations

Amirhossein Mesbah, Reshad Hosseini, Seyed Pooya Shariatpanahi, Majid Nili Ahmadabadi

TL;DR

The paper tackles subgoal discovery in reinforcement learning to improve sample efficiency and reward shaping by introducing a free energy–based framework that selects between a Main state space and an Aggregation space. Subgoals (bottlenecks) are identified where the aggregation space becomes uncertain, as quantified by a free energy objective $F(s,m,\pi)$. The method uses Thompson sampling to approximate action-value distributions across spaces and applies Otsu thresholding with non-maximum suppression to extract bottlenecks, proving effective in both discrete grid-worlds and continuous settings with deep nets. This approach avoids explicit graph construction or predefined subgoal counts and demonstrates robustness to environment stochasticity, offering a scalable, model-free pathway to automatic subgoal discovery for HRL and GCRL.

Abstract

Reinforcement learning (RL) plays a major role in solving complex sequential decision-making tasks. Hierarchical and goal-conditioned RL are promising methods for dealing with two major problems in RL, namely sample inefficiency and difficulties in reward shaping. These methods tackle the mentioned problems by decomposing a task into simpler subtasks and temporally abstracting a task in the action space. One of the key components for task decomposition of these methods is subgoal discovery. We can use the subgoal states to define hierarchies of actions and also use them in decomposing complex tasks. Under the assumption that subgoal states are more unpredictable, we propose a free energy paradigm to discover them. This is achieved by using free energy to select between two spaces, the main space and an aggregation space. The $model \; changes$ from neighboring states to a given state shows the unpredictability of a given state, and therefore it is used in this paper for subgoal discovery. Our empirical results on navigation tasks like grid-world environments show that our proposed method can be applied for subgoal discovery without prior knowledge of the task. Our proposed method is also robust to the stochasticity of environments.

Subgoal Discovery Using a Free Energy Paradigm and State Aggregations

TL;DR

The paper tackles subgoal discovery in reinforcement learning to improve sample efficiency and reward shaping by introducing a free energy–based framework that selects between a Main state space and an Aggregation space. Subgoals (bottlenecks) are identified where the aggregation space becomes uncertain, as quantified by a free energy objective . The method uses Thompson sampling to approximate action-value distributions across spaces and applies Otsu thresholding with non-maximum suppression to extract bottlenecks, proving effective in both discrete grid-worlds and continuous settings with deep nets. This approach avoids explicit graph construction or predefined subgoal counts and demonstrates robustness to environment stochasticity, offering a scalable, model-free pathway to automatic subgoal discovery for HRL and GCRL.

Abstract

Reinforcement learning (RL) plays a major role in solving complex sequential decision-making tasks. Hierarchical and goal-conditioned RL are promising methods for dealing with two major problems in RL, namely sample inefficiency and difficulties in reward shaping. These methods tackle the mentioned problems by decomposing a task into simpler subtasks and temporally abstracting a task in the action space. One of the key components for task decomposition of these methods is subgoal discovery. We can use the subgoal states to define hierarchies of actions and also use them in decomposing complex tasks. Under the assumption that subgoal states are more unpredictable, we propose a free energy paradigm to discover them. This is achieved by using free energy to select between two spaces, the main space and an aggregation space. The from neighboring states to a given state shows the unpredictability of a given state, and therefore it is used in this paper for subgoal discovery. Our empirical results on navigation tasks like grid-world environments show that our proposed method can be applied for subgoal discovery without prior knowledge of the task. Our proposed method is also robust to the stochasticity of environments.

Paper Structure

This paper contains 12 sections, 25 equations, 10 figures, 2 algorithms.

Figures (10)

  • Figure 1: This figure illustrates a two-room environment with a doorway connecting them and a goal in the bottom right (orange). An agent starting from the top left can move in four directions. Each state in the main space corresponds to a 3x3 block of states in the aggregation space. The agent chooses between main space ($\pi(a|s,m_{Main})$) and aggregation space ($\pi(a|s,m_{Agg})$) policies based on their Free Energy (uncertainty measure) - selecting the space with a lower free energy. Near the bottleneck (doorway) states, the main space is preferred due to its lower free energy, while the aggregation space is chosen in states distant from doorways.
  • Figure 2: This figure demonstrates the agent's state space selection in a two-room environment. The blue circle represents the agent's current position, moving toward the goal (orange). States are assigned either $Main (M)$ or $Aggregation (Agg)$ space based on the minimum free energy. Near and at the doorway, the agent switches from Aggregation to Main space due to higher uncertainty in aggregated states at bottlenecks, illustrating the $model\; changes$ during navigation.
  • Figure 3: This figure tracks the incremental process of model changes from episodes 10 to 50 in a two-room environment, where the agent starts from the upper left corner and aims to reach a white goal state in the bottom right. The top row shows the number of model changes in transitioning from one state to another, while the bottom row shows identified bottlenecks considering the model changes. By episode 40, the agent accurately identifies the doorway as a bottleneck, and by episode 50, it also recognizes key states along the optimal path from the top-left start to the bottom-right goal.
  • Figure 4: This figure shows detected bottlenecks (in light blue) across six different environments at episode 50, using a state aggregation distance of L=2 for aggregation. In each environment, the agent starts from the upper left corner and aims to reach a white goal state in the bottom right. In the 1-room with hallway environment, bottlenecks appear along the hallway and near the goal due to early exploration patterns and low model changes throughout this environment. While these hallway bottlenecks could be useful for defining "leaving hallway" options, increasing L could prevent their detection in case they are undesired.
  • Figure 5: This diagram illustrates the proposed method. The system evaluates free energy in both aggregation and main spaces using an approximate estimation of Thompson sampling and behavioral policy. The bottleneck discovery module tracks model changes between states, applies Otsu's thresholding, and uses non-maximum suppression to identify bottleneck states. The agent interacts with the environment, receiving rewards and next states while updating its state space model based on free energy evaluation.
  • ...and 5 more figures