Table of Contents
Fetching ...

Complex behavior from intrinsic motivation to occupy action-state path space

Jorge Ramírez-Ruiz, Dmytro Grytskyy, Chiara Mastrogiuseppe, Yamen Habib, Rubén Moreno-Bote

TL;DR

The paper reframes intelligent behavior as maximizing future occupancy of action-state paths rather than maximizing extrinsic rewards, formalizing this as the maximum occupancy principle (MOP) with an intrinsic return based on the sum of action and successor-state entropies. It shows that the occupancy measure is uniquely given by path entropy, derives a Bellman-like equation for the optimal policy and value, and provides a convergent iterative z-map to compute the optimal value. Across discrete and continuous tasks—including four-room navigation, predator-prey interactions, a dancing cartpole, altruistic fence scenarios, and a high-dimensional quadruped—the MOP agents exhibit rich, variable, and seemingly goal-directed behaviors without reward maximization, while comparisons to empowerment and free-energy approaches highlight higher behavioral diversity in MOP. These results suggest intrinsic path-occupancy motivation as a general, scalable framework for exploring variability and goal-directedness in artificial agents, with potential implications for unsupervised skill discovery and robust exploration in complex environments.

Abstract

Most theories of behavior posit that agents tend to maximize some form of reward or utility. However, animals very often move with curiosity and seem to be motivated in a reward-free manner. Here we abandon the idea of reward maximization, and propose that the goal of behavior is maximizing occupancy of future paths of actions and states. According to this maximum occupancy principle, rewards are the means to occupy path space, not the goal per se; goal-directedness simply emerges as rational ways of searching for resources so that movement, understood amply, never ends. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy and state-value function, and prove convergence of our value iteration algorithm. Using discrete and continuous state tasks, including a high--dimensional controller, we show that complex behaviors such as `dancing', hide-and-seek and a basic form of altruistic behavior naturally result from the intrinsic motivation to occupy path space. All in all, we present a theory of behavior that generates both variability and goal-directedness in the absence of reward maximization.

Complex behavior from intrinsic motivation to occupy action-state path space

TL;DR

The paper reframes intelligent behavior as maximizing future occupancy of action-state paths rather than maximizing extrinsic rewards, formalizing this as the maximum occupancy principle (MOP) with an intrinsic return based on the sum of action and successor-state entropies. It shows that the occupancy measure is uniquely given by path entropy, derives a Bellman-like equation for the optimal policy and value, and provides a convergent iterative z-map to compute the optimal value. Across discrete and continuous tasks—including four-room navigation, predator-prey interactions, a dancing cartpole, altruistic fence scenarios, and a high-dimensional quadruped—the MOP agents exhibit rich, variable, and seemingly goal-directed behaviors without reward maximization, while comparisons to empowerment and free-energy approaches highlight higher behavioral diversity in MOP. These results suggest intrinsic path-occupancy motivation as a general, scalable framework for exploring variability and goal-directedness in artificial agents, with potential implications for unsupervised skill discovery and robust exploration in complex environments.

Abstract

Most theories of behavior posit that agents tend to maximize some form of reward or utility. However, animals very often move with curiosity and seem to be motivated in a reward-free manner. Here we abandon the idea of reward maximization, and propose that the goal of behavior is maximizing occupancy of future paths of actions and states. According to this maximum occupancy principle, rewards are the means to occupy path space, not the goal per se; goal-directedness simply emerges as rational ways of searching for resources so that movement, understood amply, never ends. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy and state-value function, and prove convergence of our value iteration algorithm. Using discrete and continuous state tasks, including a high--dimensional controller, we show that complex behaviors such as `dancing', hide-and-seek and a basic form of altruistic behavior naturally result from the intrinsic motivation to occupy path space. All in all, we present a theory of behavior that generates both variability and goal-directedness in the absence of reward maximization.
Paper Structure (81 sections, 6 theorems, 74 equations, 16 figures, 1 table)

This paper contains 81 sections, 6 theorems, 74 equations, 16 figures, 1 table.

Key Result

Theorem 1

$C(p)=-k \ln p$ with $k>0$ is the only function that satisfies Conditions 1-4

Figures (16)

  • Figure 1: MOP agents maximize action-state path occupancy. (a) A MOP agent (grey triangle) in the middle of two rooms has the choice between going left or right. When the number of actions (black arrows) in each room is the same, the agent prefers going to the room with more state transitions (blue arrows indicate random transitions after choosing moving right or moving left actions, and pink arrow width indicates the probabilities of those actions). (b) When the states transitions are the same in the two rooms, the MOP agent prefers the room with more available actions. (c) If there are many absorbing states in the room where many actions are available, the MOP agent avoids it. (d) Even if there are action and state-transition incentives (in the left room), a MOP agent might prefer a region of state space where it can reliably get food (right room), ensuring occupancy of future action-state paths. See Supplemental Fig. \ref{['fig:schematic_formal']} for a more formal example.
  • Figure 2: Maximizing future path occupancy leads to high occupancy of physical space. (a) Grid-world arena. The agents have nine available actions (arrows, and staying still) when alive (internal energy $E$ larger than zero) and away from walls. There are four rooms, each with a small food source in a corner (green diamonds). (b) Probability of visited spatial states for a MOP agent, an $\epsilon$-greedy reward (R) agent that survives as long as the MOP agent, and a random walker. Food gain $=10$ units, maximum reservoir energy $=100$, episodes of $5\times 10^4$ time steps, and $(\alpha,\beta)=(1,0)$ for the MOP agent. All agents are initialized in the middle of the lower left room. (c) Optimal value function $V^*(s)$ over locations when energy is $E = 5$. Black arrows represent the optimal policy given by Eq. \ref{['eq_pi_opt_m']}; their length is proportional to the probability of each action. The size of red dots is proportional to the probability of the do nothing action. (d) Fraction of locations of the arena visited at least once per episode as a function of food gain. Error bars correspond to s.e.m over $50$ episodes. (e) Noisy room problem. The bottom right room of the arena was noisy, such that agents in this room jump randomly to neighboring locations regardless of their actions. Food gain equals maximum reservoir energy $=100$. Histogram of visited locations for an episode as long as in (b) for a MOP agent with $\beta=0.3$ (left) and time fraction spent in the noisy room (right) show that MOP agents with $\beta> 0$ can either be attracted to the room or repelled depending on $\gamma$.
  • Figure 3: Complex hide-and-seek and escaping strategies in a prey-predator example. (a) Grid-world arena. The agent has nine available actions when alive and far from walls. There is a small food source in a corner (green diamond). A predator (red, down triangle) is attracted to the agent (gray, up triangle), such that when they are at the same location, the agent dies. The predator cannot enter the locations surrounded by the brown border. Arrows show a clockwise trajectory. (b) Histogram of visited spatial states across episodes for the MOP and R agents. The vector field at each location indicates probability of transition at each location. Green arrows on R agent show major motion directions associated with its dominant clockwise rotation. (c) Fraction of clockwise rotations (as in panel (a)) to total rotations as a function of food gain, averaged over epochs of 500 timesteps. Error bars are s.e.m. (d) Optimal value functions for different energy levels, and same predator position; black arrows indicate optimal policy, as in Fig. \ref{['fig:fourrooms']}c.
  • Figure 4: Dancing of a MOP cartpole. (a) The cart (brown rectangle) has a pole attached. The cartpole reaches an absorbing state if the magnitude of the angle $\theta$ exceeds $36 \deg$ or its position reaches the borders. There are 5 available actions when alive: a big and a small force to either side (arrows on cartpole), and doing nothing (full circle). (b) Time-shifted snapshots of the pole in the reference frame of the cart as a function of time for the MOP (top) and R (bottom) agents. (c) Position and angle occupation for a $2 \times 10^5$ time step episode. (d) Here, the right half of the arena is stochastic, while the left remains deterministic. In the stochastic half, the intended state transition due to an applied action (force) succeeds with probability $1-\eta$ (and thus zero force is applied with probability $\eta$). (e) Fraction of time spent on the right half of the arena increases as a function of $\beta$, regardless of the failure probability $\eta$. (f) The fraction has a non-monotonic behavior as a function of $\eta$ when state entropy is important for the agent ($\beta=1$), highlighting a stochastic resonance behavior. When the agents do not seek state entropy ($\beta=0$) the fraction of time spent by the agent on the right decreases with the failure probability, and thus they avoid the stochastic right side. $\gamma = 0.99$ for panels (e,f).
  • Figure 5: Modelling altruism through an optimal tradeoff between own action entropy and other's state entropy. (a) An agent (gray up triangle) has access to nine movement actions (gray arrows and doing nothing), and open or close a fence (dashed blue lines). This fence does not affect its movements. A pet (green, down triangle) has access to the same actions, and chooses one randomly at each timestep, but is constrained by the fence when closed. Pet location is part of the state of the agent. (b) As $\beta$ in Eq. (\ref{['eq_expected_return']}) is increased, the agent tends to leave the fence open for a larger fraction of time. This helps its pet reach other parts of the arena. Error bars correspond to s.e.m. (c) Occupation heatmaps for 2000 timestep-episodes for $\beta = 0$ (left) and $\beta=1$ (right). In all cases $\alpha=1$.
  • ...and 11 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Corollary 1
  • proof
  • Corollary 2
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof