Table of Contents
Fetching ...

Free Energy Projective Simulation (FEPS): Active inference with interpretability

Joséphine Pazem, Marius Krumm, Alexander Q. Vining, Lukas J. Fiderer, Hans J. Briegel

TL;DR

The results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only and infer optimal policies flexibly for any target observation in the environment.

Abstract

In the last decade, the free energy principle (FEP) and active inference (AIF) have achieved many successes connecting conceptual models of learning and cognition to mathematical models of perception and action. This effort is driven by a multidisciplinary interest in understanding aspects of self-organizing complex adaptive systems, including elements of agency. Various reinforcement learning (RL) models performing active inference have been proposed and trained on standard RL tasks using deep neural networks. Recent work has focused on improving such agents' performance in complex environments by incorporating the latest machine learning techniques. In this paper, we take an alternative approach. Within the constraints imposed by the FEP and AIF, we attempt to model agents in an interpretable way without deep neural networks by introducing Free Energy Projective Simulation (FEPS). Using internal rewards only, FEPS agents build a representation of their partially observable environments with which they interact. Following AIF, the policy to achieve a given task is derived from this world model by minimizing the expected free energy. Leveraging the interpretability of the model, techniques are introduced to deal with long-term goals and reduce prediction errors caused by erroneous hidden state estimation. We test the FEPS model on two RL environments inspired from behavioral biology: a timed response task and a navigation task in a partially observable grid. Our results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only. In addition, they infer optimal policies flexibly for any target observation in the environment.

Free Energy Projective Simulation (FEPS): Active inference with interpretability

TL;DR

The results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only and infer optimal policies flexibly for any target observation in the environment.

Abstract

In the last decade, the free energy principle (FEP) and active inference (AIF) have achieved many successes connecting conceptual models of learning and cognition to mathematical models of perception and action. This effort is driven by a multidisciplinary interest in understanding aspects of self-organizing complex adaptive systems, including elements of agency. Various reinforcement learning (RL) models performing active inference have been proposed and trained on standard RL tasks using deep neural networks. Recent work has focused on improving such agents' performance in complex environments by incorporating the latest machine learning techniques. In this paper, we take an alternative approach. Within the constraints imposed by the FEP and AIF, we attempt to model agents in an interpretable way without deep neural networks by introducing Free Energy Projective Simulation (FEPS). Using internal rewards only, FEPS agents build a representation of their partially observable environments with which they interact. Following AIF, the policy to achieve a given task is derived from this world model by minimizing the expected free energy. Leveraging the interpretability of the model, techniques are introduced to deal with long-term goals and reduce prediction errors caused by erroneous hidden state estimation. We test the FEPS model on two RL environments inspired from behavioral biology: a timed response task and a navigation task in a partially observable grid. Our results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only. In addition, they infer optimal policies flexibly for any target observation in the environment.

Paper Structure

This paper contains 23 sections, 31 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Architecture and training of an FEPS agent a) Architecture of a FEPS agent, with four sensory states (squares) and two possible actions (diamonds). The agent has two main components: the world model and the policy. The world model is composed of vertices representing observations (squares) while clone clips represent all values a belief state can take (circles). As in a clone-structured graph, each clone clip $b$ relates to exactly one observation $s$ and the emission function $p(s|b)$ is deterministic. The clone clips, together with the set of edges between them, form an ECM. A belief state, circled in purple, is designated by an excited clone clip. The weighted edges in the ECM encode the transition function and are trainable with reinforcement: there is one set of edges per action (light and dark turquoise arrows). The belief state in the ECM is an input to the policy, where the probability of sampling an action is a function of the EFE. In turn, the action that was selected determines the edge set to sample from in the world model in order to make a prediction for the next belief state and observation. b) Training of the world model of a FEPS agent. The agent interacts with the environment by receiving observations and implementing actions. When an action $a_{t}$ is chosen, a corresponding edge $b_{t} \xrightarrow{a_{t}} b_{t+1}$ is sampled in the world model, from the current to the next belief state, conditioned on the action. The observation $s_{t+1}$ associated with the next belief state is the prediction for the next sensory state. Simultaneously, the action is applied to the environment and creates a transition in the hidden states of the environment, $e_{t} \xrightarrow{a_{t}} e_{t+1}$ (bottom, green rectangle). This transition is perceived by the agent through the observation $s^\mathrm{env}_{t+1}$. Finally, the weights of the edges are updated. The reinforcement of an edge is proportional to the number of correct predictions it enabled in a row, as depicted with the thickness of the arrows in the world model. When the agent makes an incorrect prediction (the purple arrow), the reinforcements are applied to the edges that contributed to the trajectory. The last, incorrect, edge is not reinforced.
  • Figure 2: Estimation of belief states in superposition, after the world model has been trained. To minimize its prediction error due to faulty belief state estimation, an agent considers multiple clone clips as candidate belief states simultaneously. For the initial observation, $s_1^\mathrm{env}$ (on the left), the agent includes all corresponding clone clips $\{b_t^i\}_{i=1}^3$ to its hypothesis, as depicted on the right. Conditioned on the chosen action, a clone clip is sampled for each candidate belief state to represent the next one. Finally, all clone clips that are incompatible with the observation from the environment are eliminated from the hypothesis. The clips that remain become the current candidate belief states. In the world model, the thickness of the arrows represents the look-ahead preferences: the larger the arrow, the more advantageous is the transition in order to reach the target observation, $s_4$ in this case.
  • Figure 3: MDP for the timed response environment This environment has four hidden states. The observations are compositional and contain information that are both external (light on or off) and internal (hungry or satiated) to the agent. Arrows correspond to the transitions the actions the agent can result in. In this environment, the agent can either wait or press a lever. In order to complete the task, the agent must reach $E_0'$ and feel satiated. The only way to do so is to follow the actions marked with thicker arrows. The observation (light on, hungry) is called ambiguous because it can be emitted by two hidden states $E_1$ and $E_2$ that can only be distinguished with context.
  • Figure 4: Training FEPS agents for the timed response task a) Evolution of the variational free energy (top) defined in Eq.\ref{['eq: VFE']} and expected free energy as in Eq. \ref{['eq: EFE']} (bottom) during the training, averaged over 100 agents and a time window of 100 episodes. At each step, the VFE depends on the specific belief states and actions that were sampled. Two types of training are compared: a first set, "task" in dark purple, learned the model with a preference to fulfill a task in the environment, while the second set, "wander", in green, experimented aimlessly in the environment with a uniform policy before switching to the task. Both were trained for 4000 steps before being tested on the task. The best and worst agents are represented with dashed and dotted lines respectively, and examples of individual agents were traced with transparent lines, full for task-oriented agents, and dashed for the wandering ones. When the VFE converges to its minimal value, the world model is precise enough for most belief states to make planning possible. As expected from the values chosen for the scaling parameter $\zeta$, agents select actions that minimize and maximize the EFE for task-oriented and wandering agents respectively. The EFE of wandering agents plateaus quickly at the limit derived for Appendix \ref{['appendix: limit EFE wandering']}. b) World model learned by one of the agents trained on the task, where each circle is a belief state, whose observation is denoted by its colors and label. The numbers at the center of the circles are the clone indices for each clone clip. Arrows indicate the transitions learned in the world model, red for action "press the lever", and blue for "waiting". Dashed lines indicate that both actions lead to the same transition. The weight on the arrow indicates its probability in the world model. Stars mark transitions that were identified as useful to achieve the goal with a probability of 1 in the preference distribution. The policy is indicated by the thickness of the arrows, where a thick arrow corresponds to probabilities close to 1, and thinner close to 0.5.
  • Figure 5: Training results for the grid world environment. a) Evolution of the length of the trajectories during the training, for different scaling parameters ranging from -3 to 3, and different preference distributions: the agent can either learn to complete the task from the start ("task"), or first wander in the grid ("wander"). We represent the running averages over a time window of 1500 steps of the lengths of trajectories, averaged over 30 agents. These lengths depend on the specific belief states and actions sampled by the agents. b) Evolution of the variational and expected free energies during the training for the two best settings in a): the task-oriented preferences are paired with a scaling parameter of -3, and the wandering preferences with a parameter of +1. The thick lines represent the running average of energies over a time window of 1500 steps, averaged over all 30 agents, while the dotted and dashed lines stand for the best and worst agents, that is the agents whose VFE converge first and last, respectively, to a minimum. The transparent lines indicate the behavior of agents selected randomly: these lines are full for task-oriented agents and dashed for wandering agents. c) Comparison of the accuracy of the model to make predictions between the two best parameter settings in a). The policies of the agents are uniform, such that the actions yielding most certain outcomes cannot be relied upon to validate predictions. Two belief state estimation techniques are tested. Bare estimation samples a single clone clip at a time, whereas the evaluation of belief states in superposition allows multiple clone clips to represent candidate belief states simultaneously, as long as they produce predictions that are compatible with the next observation. Each individual agent is tested over 1000 trajectories of at most 80 steps.
  • ...and 1 more figures