Table of Contents
Fetching ...

Deep RL Needs Deep Behavior Analysis: Exploring Implicit Planning by Model-Free Agents in Open-Ended Environments

Riley Simmons-Edler, Ryan P. Badman, Felix Baastad Berg, Raymond Chua, John J. Vastola, Joshua Lunger, William Qian, Kanaka Rajan

TL;DR

This work addresses the need for behavior-focused diagnostics in deep RL by introducing ForageWorld, a naturalistic, partially observable foraging environment. It develops a neuroscience-inspired analysis framework and applies it to model-free PPO-RNN agents, showing that memory and planning-like behaviors can emerge without explicit world models. Through decoding of allocentric position, GLM analyses, and ablations, the study reveals structured exploration, patch revisitation, and modular internal representations that support long-horizon planning. It demonstrates staged skill acquisition during training and shows how architectural choices such as recurrence, pruning, and auxiliary losses influence both performance and interpretability. By releasing open-source pipelines and linking behavioral and neural analyses, the paper contributes to neuroAI and offers a robust platform for evaluating safe, desirable behaviors in open-ended autonomous agents.

Abstract

Understanding the behavior of deep reinforcement learning (DRL) agents -particularly as task and agent sophistication increase- requires more than simple comparison of reward curves, yet standard methods for behavioral analysis remain underdeveloped in DRL. We apply tools from neuroscience and ethology to study DRL agents in a novel, complex, partially observable environment, ForageWorld, designed to capture key aspects of real-world animal foraging- including sparse, depleting resource patches, predator threats, and spatially extended arenas. We use this environment as a platform for applying joint behavioral and neural analysis to agents, revealing detailed, quantitatively grounded insights into agent strategies, memory, and planning. Contrary to common assumptions, we find that model-free RNN-based DRL agents can exhibit structured, planning-like behavior purely through emergent dynamics- without requiring explicit memory modules or world models. Our results show that studying DRL agents like animals -analyzing them with neuroethology-inspired tools that reveal structure in both behavior and neural dynamics- uncovers rich structure in their learning dynamics that would otherwise remain invisible. We distill these tools into a general analysis framework linking core behavioral and representational features to diagnostic methods, which can be reused for a wide range of tasks and agents. As agents grow more complex and autonomous, bridging neuroscience, cognitive science, and AI will be essential- not just for understanding their behavior, but for ensuring safe alignment and maximizing desirable behaviors that are hard to measure via reward. We show how this can be done by drawing on lessons from how biological intelligence is studied.

Deep RL Needs Deep Behavior Analysis: Exploring Implicit Planning by Model-Free Agents in Open-Ended Environments

TL;DR

This work addresses the need for behavior-focused diagnostics in deep RL by introducing ForageWorld, a naturalistic, partially observable foraging environment. It develops a neuroscience-inspired analysis framework and applies it to model-free PPO-RNN agents, showing that memory and planning-like behaviors can emerge without explicit world models. Through decoding of allocentric position, GLM analyses, and ablations, the study reveals structured exploration, patch revisitation, and modular internal representations that support long-horizon planning. It demonstrates staged skill acquisition during training and shows how architectural choices such as recurrence, pruning, and auxiliary losses influence both performance and interpretability. By releasing open-source pipelines and linking behavioral and neural analyses, the paper contributes to neuroAI and offers a robust platform for evaluating safe, desirable behaviors in open-ended autonomous agents.

Abstract

Understanding the behavior of deep reinforcement learning (DRL) agents -particularly as task and agent sophistication increase- requires more than simple comparison of reward curves, yet standard methods for behavioral analysis remain underdeveloped in DRL. We apply tools from neuroscience and ethology to study DRL agents in a novel, complex, partially observable environment, ForageWorld, designed to capture key aspects of real-world animal foraging- including sparse, depleting resource patches, predator threats, and spatially extended arenas. We use this environment as a platform for applying joint behavioral and neural analysis to agents, revealing detailed, quantitatively grounded insights into agent strategies, memory, and planning. Contrary to common assumptions, we find that model-free RNN-based DRL agents can exhibit structured, planning-like behavior purely through emergent dynamics- without requiring explicit memory modules or world models. Our results show that studying DRL agents like animals -analyzing them with neuroethology-inspired tools that reveal structure in both behavior and neural dynamics- uncovers rich structure in their learning dynamics that would otherwise remain invisible. We distill these tools into a general analysis framework linking core behavioral and representational features to diagnostic methods, which can be reused for a wide range of tasks and agents. As agents grow more complex and autonomous, bridging neuroscience, cognitive science, and AI will be essential- not just for understanding their behavior, but for ensuring safe alignment and maximizing desirable behaviors that are hard to measure via reward. We show how this can be done by drawing on lessons from how biological intelligence is studied.

Paper Structure

This paper contains 41 sections, 5 equations, 31 figures, 3 tables.

Figures (31)

  • Figure 1: Model-free agents exhibit structured exploration and revisitation behavior in an open-ended, partially observable environment. (A) A 9$\times$11 grid view (left, local observation window) shows the agent’s local observation, positioned within the full 96$\times$96 environment (right). Cows (food) diffuse from spawn points and deplete when consumed; lakes (drink) are fixed and unlimited; predators intermittently pursue agents in view. (B) Agent observations include the visual window, inventory, and internal states (e.g., health, hunger, fatigue). This agent is sleeping—typically in corners to reduce predator contact. (C) Trajectories from a single episode: three early sequential exploratory paths (left), initial revisitation (middle), and full path with revisited patches in orange (right). The paths depicted belong to the same episode, and are directly consecutive from Path 1 through 4 in the behavioral time series logs for the episode. Path 1 starts at t=0 of the episode.
  • Figure 2: Performance metrics and model ablations motivate deeper behavioral-neural analysis. (Top left) ForageWorld requires memory and substantial network capacity to learn. Replacing the 512-unit RNN with a feedforward network significantly impairs performance, as does downsizing to a 64-unit network (550k parameters). A midsize 128-unit network also underperforms when pruned, suggesting that high capacity is needed to support sparsity. (Top right) Pruning does not degrade training performance but improves the spatial interpretability of internal representations (see \ref{['fig:Fig4decPrune']}). (Bottom left) Removing the auxiliary path integration loss reduces performance in large arenas—but not small ones. Thus, the ability to predict current self-position appears critical for larger-scale navigation (see \ref{['fig:Fig4decNoPathInt']} for representational effects). (Bottom right) Limiting perception to a forward-facing field of view improves early learning—perhaps due to reduced input complexity—but final performance conceals behavioral differences (see \ref{['fig:Fig3LHfrontFOV']}). (All plots) Curves show the time-weighted EMA mean across 5 random seeds; shaded regions show one standard deviation.
  • Figure 3: Memory-guided, multi-objective revisitation strategies emerge in model-free agents without world models. Generalized linear model (GLM) coefficients for patch history variables predicting an agent’s choice to revisit one patch over others. Choices are defined 50 timesteps before each patch eat event. From left to right: agents prefer patches with fewer prior eat actions (EatRate); show no preference for water proximity (DrinkRate); avoid patches with more predator encounters (PredRate); prefer patches visited more recently (Recency); show mild preference for longer dwell time (Dwelltime); prefer patches with more observed cows (CowCount); and prefer patches with higher prior position prediction error (Uncertainty). Significance levels: * $<$ 0.05, ** $<$ 0.01, *** $<$ 0.001. \ref{['fig:GLMstats']} has model output and VIF analysis. Error bars are 95% confidence intervals (CI).
  • Figure 4: Distinct behavioral competencies emerge over training, with early gains in exploration followed by refinement of survival strategies. Training dynamics show staged acquisition of task-aligned skills. Early learning emphasizes spatial exploration and arena coverage, while later training refines predator response, tool use, and pathing patterns. Metrics include: spatial uncertainty (normalized by distance from origin); distance from origin (early vs. late, meaning before vs. after the first 1500 timesteps in an episode); state occupancy entropy of agent position; angular orientation variance across 250 timestep intervals; predator field-of-view exposure; tool-making rate (1 = one tool crafted, 2 = both); and food/water satiation levels. Error bars are 95% CI. These plots can be compared to the generally linear performance gains seen in survival times across training in \ref{['fig:performance']}.
  • Figure 5: RNN states encode allocentric position and temporal structure, revealing emergent memory and planning capacity in model-free agents. (A) A single decoding model was trained per agent to test whether spatial information was encoded in allocentric (relative to origin) or egocentric (relative to agent) coordinates. (B) Late-training decoding performance for egocentric distance remained at chance level across models. (C) Allocentric position could be decoded above chance up to approximately 50–100 timesteps into the past and future depending on the run. Training arena count per decoder was varied to show that models did not rely on arena-specific cues. (D) Allocentric decoding improved over training. Error bars are larger in (D) because decoding was limited to timesteps 1000–6000 per arena, due to shorter survival in early-training (10k epoch) agents. All plots use average displacement per timestep as a chance baseline. Error bars reflect 95% CI.
  • ...and 26 more figures