Table of Contents
Fetching ...

Dynamic Reinforcement Learning for Actors

Katsunari Shibata

TL;DR

This work introduces Dynamic Reinforcement Learning (Dynamic RL), a framework that embeds exploration within chaotic actor dynamics and modulates this behavior through local neuron sensitivity, rather than injecting external noise. By applying sensitivity adjustment learning (SAL) and sensitivity-controlled RL (SRL), the approach aims to balance exploration and reproducibility around state transitions, potentially enabling a progression from exploration to thinking. The experiments on memory-demanding sequencing and dynamic pattern generation show that Dynamic RL can match conventional RL performance while offering faster adaptation to environmental changes and dramatically lower actor-learning cost by eliminating backpropagation through time for the actor. If scalable and stable, this method could broaden autonomous exploration and may contribute toward thinking-like capabilities, though it also raises substantial safety and governance concerns that warrant broad discussion before wider deployment.

Abstract

Dynamic Reinforcement Learning (Dynamic RL), proposed in this paper, directly controls system dynamics, instead of the actor (action-generating neural network) outputs at each moment, bringing about a major qualitative shift in reinforcement learning (RL) from static to dynamic. The actor is initially designed to generate chaotic dynamics through the loop with its environment, enabling the agent to perform flexible and deterministic exploration. Dynamic RL controls global system dynamics using a local index called "sensitivity," which indicates how much the input neighborhood contracts or expands into the corresponding output neighborhood through each neuron's processing. While sensitivity adjustment learning (SAL) prevents excessive convergence of the dynamics, sensitivity-controlled reinforcement learning (SRL) adjusts them -- to converge more to improve reproducibility around better state transitions with positive TD error and to diverge more to enhance exploration around worse transitions with negative TD error. Dynamic RL was applied only to the actor in an Actor-Critic RL architecture while applying it to the critic remains a challenge. It was tested on two dynamic tasks and functioned effectively without external exploration noise or backward computation through time. Moreover, it exhibited excellent adaptability to new environments, although some problems remain. Drawing parallels between 'exploration' and 'thinking,' the author hypothesizes that "exploration grows into thinking through learning" and believes this RL could be a key technique for the emergence of thinking, including inspiration that cannot be reconstructed from massive existing text data. Finally, despite being presumptuous, the author presents the argument that this research should not proceed due to its potentially fatal risks, aiming to encourage discussion.

Dynamic Reinforcement Learning for Actors

TL;DR

This work introduces Dynamic Reinforcement Learning (Dynamic RL), a framework that embeds exploration within chaotic actor dynamics and modulates this behavior through local neuron sensitivity, rather than injecting external noise. By applying sensitivity adjustment learning (SAL) and sensitivity-controlled RL (SRL), the approach aims to balance exploration and reproducibility around state transitions, potentially enabling a progression from exploration to thinking. The experiments on memory-demanding sequencing and dynamic pattern generation show that Dynamic RL can match conventional RL performance while offering faster adaptation to environmental changes and dramatically lower actor-learning cost by eliminating backpropagation through time for the actor. If scalable and stable, this method could broaden autonomous exploration and may contribute toward thinking-like capabilities, though it also raises substantial safety and governance concerns that warrant broad discussion before wider deployment.

Abstract

Dynamic Reinforcement Learning (Dynamic RL), proposed in this paper, directly controls system dynamics, instead of the actor (action-generating neural network) outputs at each moment, bringing about a major qualitative shift in reinforcement learning (RL) from static to dynamic. The actor is initially designed to generate chaotic dynamics through the loop with its environment, enabling the agent to perform flexible and deterministic exploration. Dynamic RL controls global system dynamics using a local index called "sensitivity," which indicates how much the input neighborhood contracts or expands into the corresponding output neighborhood through each neuron's processing. While sensitivity adjustment learning (SAL) prevents excessive convergence of the dynamics, sensitivity-controlled reinforcement learning (SRL) adjusts them -- to converge more to improve reproducibility around better state transitions with positive TD error and to diverge more to enhance exploration around worse transitions with negative TD error. Dynamic RL was applied only to the actor in an Actor-Critic RL architecture while applying it to the critic remains a challenge. It was tested on two dynamic tasks and functioned effectively without external exploration noise or backward computation through time. Moreover, it exhibited excellent adaptability to new environments, although some problems remain. Drawing parallels between 'exploration' and 'thinking,' the author hypothesizes that "exploration grows into thinking through learning" and believes this RL could be a key technique for the emergence of thinking, including inspiration that cannot be reconstructed from massive existing text data. Finally, despite being presumptuous, the author presents the argument that this research should not proceed due to its potentially fatal risks, aiming to encourage discussion.

Paper Structure

This paper contains 18 sections, 24 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: The difference in exploration between conventional RL and humans, who have inspired the exploration in the proposed Dynamic RL. In humans or Dynamic RL, the actor RNN embeds exploration factors into motor commands by inducing chaotic system dynamics without stochastic selection using a random number generator. (RNN: recurrent neural network)
  • Figure 2: The author's concept of the relationship between 'exploration' and 'thinking' and how they relate to system dynamics. 'Thinking' and 'exploration' are similar and inseparable in that both require multistep autonomous state transitions, and they exist continuously on a spectrum characterized by chaotic dynamics. In 'exploration,' the state transitions must be irregular. In 'thinking,' the state transitions must not only be less irregular but also rational. Dynamic RL controls the system dynamics based on a value function using reinforcement signals while preserving chaotic dynamics.
  • Figure 3: A conceptual diagram explaining the degrees of freedom (DOFs) that dynamics have for a sample case of three-dimensional state space as an example. (A) Convergence (a) or divergence (b) may vary depending on the state even in the same system. (B) Convergence or divergence can be varied depending on the direction even in the same state. For easy viewing, the neighborhood around a state has originally three dimensions, but only two dimensions are presented. Additionally, the directions of the two eigenvectors are assumed to be orthogonal to each other and also to the direction of state transition.
  • Figure 4: Dynamic RL applies either SAL or SRL depending on the condition in each neuron.
  • Figure 5: Basic concept of Dynamic RL (or more specifically, SRL) proposed in this paper.
  • ...and 14 more figures