Table of Contents
Fetching ...

Emergent time-keeping mechanisms in a deep reinforcement learning agent performing an interval timing task

Amrapali Pednekar, Alvaro Garrido, Pieter Simoens, Yara Khaluf

TL;DR

The paper addresses how temporal processing and interval timing can emerge in a DRL agent and whether such internal time-keeping resembles biological mechanisms. It trains a CNN-LSTM agent with PPO on a video-based duration production task and analyzes hidden-state dynamics with FFT and PCA, finding oscillations whose frequencies align with the target interval, e.g., a target of four steps yields a frequency of $0.25$ cycles per time-step. The key findings show that high-amplitude oscillations in the LSTM states drive timing decisions and persist across different input videos, indicating internalization of time-keeping; the behavior demonstrates robustness to environmental changes while still relying on temporal coherence during training. The work introduces a biologically inspired interpretation via parallels to the Striatal Beat Frequency model and discusses circadian rhythm–like dynamics as a broader biological analogy, illustrating how DRL systems can illuminate temporal processing in biological systems.

Abstract

Drawing parallels between Deep Artificial Neural Networks (DNNs) and biological systems can aid in understanding complex biological mechanisms that are difficult to disentangle. Temporal processing, an extensively researched topic, is one such example that lacks a coherent understanding of its underlying mechanisms. In this study, we investigate temporal processing in a Deep Reinforcement Learning (DRL) agent performing an interval timing task and explore potential biological counterparts to its emergent behavior. The agent was successfully trained to perform a duration production task, which involved marking successive occurrences of a target interval while viewing a video sequence. Analysis of the agent's internal states revealed oscillatory neural activations, a ubiquitous pattern in biological systems. Interestingly, the agent's actions were predominantly influenced by neurons exhibiting these oscillations with high amplitudes and frequencies corresponding to the target interval. Parallels are drawn between the agent's time-keeping strategy and the Striatal Beat Frequency (SBF) model, a biologically plausible model of interval timing. Furthermore, the agent maintained its oscillatory representations and task performance when tested on different video sequences (including a blank video). Thus, once learned, the agent internalized its time-keeping mechanism and showed minimal reliance on its environment to perform the timing task. A hypothesis about the resemblance between this emergent behavior and certain aspects of the evolution of biological processes like circadian rhythms, has been discussed. This study aims to contribute to recent research efforts of utilizing DNNs to understand biological systems, with a particular emphasis on temporal processing.

Emergent time-keeping mechanisms in a deep reinforcement learning agent performing an interval timing task

TL;DR

The paper addresses how temporal processing and interval timing can emerge in a DRL agent and whether such internal time-keeping resembles biological mechanisms. It trains a CNN-LSTM agent with PPO on a video-based duration production task and analyzes hidden-state dynamics with FFT and PCA, finding oscillations whose frequencies align with the target interval, e.g., a target of four steps yields a frequency of cycles per time-step. The key findings show that high-amplitude oscillations in the LSTM states drive timing decisions and persist across different input videos, indicating internalization of time-keeping; the behavior demonstrates robustness to environmental changes while still relying on temporal coherence during training. The work introduces a biologically inspired interpretation via parallels to the Striatal Beat Frequency model and discusses circadian rhythm–like dynamics as a broader biological analogy, illustrating how DRL systems can illuminate temporal processing in biological systems.

Abstract

Drawing parallels between Deep Artificial Neural Networks (DNNs) and biological systems can aid in understanding complex biological mechanisms that are difficult to disentangle. Temporal processing, an extensively researched topic, is one such example that lacks a coherent understanding of its underlying mechanisms. In this study, we investigate temporal processing in a Deep Reinforcement Learning (DRL) agent performing an interval timing task and explore potential biological counterparts to its emergent behavior. The agent was successfully trained to perform a duration production task, which involved marking successive occurrences of a target interval while viewing a video sequence. Analysis of the agent's internal states revealed oscillatory neural activations, a ubiquitous pattern in biological systems. Interestingly, the agent's actions were predominantly influenced by neurons exhibiting these oscillations with high amplitudes and frequencies corresponding to the target interval. Parallels are drawn between the agent's time-keeping strategy and the Striatal Beat Frequency (SBF) model, a biologically plausible model of interval timing. Furthermore, the agent maintained its oscillatory representations and task performance when tested on different video sequences (including a blank video). Thus, once learned, the agent internalized its time-keeping mechanism and showed minimal reliance on its environment to perform the timing task. A hypothesis about the resemblance between this emergent behavior and certain aspects of the evolution of biological processes like circadian rhythms, has been discussed. This study aims to contribute to recent research efforts of utilizing DNNs to understand biological systems, with a particular emphasis on temporal processing.

Paper Structure

This paper contains 8 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overview of the DRL agent architecture and input processing pipeline: The DRL agent's architecture consists of an action network (left) and a value network (right), both processing the input frames in parallel. The agent receives one frame at each time step from the environment. The input shape transitions from the original frame to the action and state value, are shown in parentheses. At each time step, the agent can take one of the two actions: “GO” or “Interval,” to advance to the next frame and receive a reward from the environment.
  • Figure 2: Fast Fourier Transform (FFT) of the LSTM hidden state activations across time for the time production task with a target duration of four time steps. The blue lines represent neurons with high weight magnitudes (z-score $>$ 2) for the "Interval" action in the action network , while the red lines represents neurons with low weight magnitudes (z-score $<$ -2) for both actions. The gray lines correspond to all other neurons. All neurons exhibit oscillations with a frequency of 0.25 (1 peak every 4 time step).
  • Figure 3: Principal component analysis (PCA) of neural activations of the 256 hidden state neurons in the LSTM network across time for the time production task with a target interval of four time steps. The first and second principal components (in dark and light blue respectively) which exhibit an oscillatory pattern across time steps, explain 56% and 28% of variance respectively. The first principal component exhibits an oscillatory pattern with a period matching the target interval (i.e., four time steps)
  • Figure 4: Activations of the 256 LSTM hidden state neurons across time for the time production task with a target duration of 4 time steps. Neurons shown in blue had high weight magnitudes (z-score $>$ 2.0) for the 'Interval" action in the agent's action network. While those in red had the least magnitudes (z-score $<$ -2.0) for both actions. Thus, neurons with high amplitude oscillating patterns had the highest contribution in action selection. These neurons peak at the reward generating time steps (i.e., every fourth time steps).
  • Figure 5: Delayed timing task: Principal Component Analysis (PCA) of the 256 LSTM hidden state activations across time for the delayed timing task with a target interval of four time steps. The first and second principal components explain 37% and 24% variability, respectively. A noticeable shift in activation patterns occurs after the cue frame onset, which marks the beginning of the time production phase. During this phase, the first principal component exhibits an oscillatory pattern with a period matching the target interval.
  • ...and 3 more figures