Table of Contents
Fetching ...

Physical Reinforcement Learning

Sam Dillavou, Shruti Mishra

TL;DR

The paper tackles the challenge of energy-efficient, fault-tolerant reinforcement learning by deploying Contrastive Local Learning Networks (CLLNs), analog networks of self-adjusting resistors. It adapts Q-learning to operate on a simulated CLLN, encoding environmental states as input voltages and interpreting outputs as action-values, with updates governed by the local contrastive rule $\delta G_i = \alpha [ (\Delta V_i^F)^2 - (\Delta V_i^C)^2 ]$ and driven by a contrastive power difference $\mathcal{P}^C-\mathcal{P}^F$. Two tasks—a four-state, four-action MDP and a nine-state grid navigation task—show near-optimal performance in most trials, validating the approach. The work discusses which RL components map naturally to CLLNs (e.g., policy/value expectations) versus which require additional hardware (e.g., replay buffers or memory), and outlines how physical constraints shape learning in analog substrates. It also highlights implications for energy efficiency, robustness to damage, and opportunities to tailor learning toward hardware-friendly objectives.

Abstract

Digital computers are power-hungry and largely intolerant of damaged components, making them potentially difficult tools for energy-limited autonomous agents in uncertain environments. Recently developed Contrastive Local Learning Networks (CLLNs) - analog networks of self-adjusting nonlinear resistors - are inherently low-power and robust to physical damage, but were constructed to perform supervised learning. In this work we demonstrate success on two simple RL problems using Q-learning adapted for simulated CLLNs. Doing so makes explicit the components (beyond the network being trained) required to enact various tools in the RL toolbox, some of which (policy function and value function) are more natural in this system than others (replay buffer). We discuss assumptions such as the physical safety that digital hardware requires, CLLNs can forgo, and biological systems cannot rely on, and highlight secondary goals that are important in biology and trainable in CLLNs, but make little sense in digital computers.

Physical Reinforcement Learning

TL;DR

The paper tackles the challenge of energy-efficient, fault-tolerant reinforcement learning by deploying Contrastive Local Learning Networks (CLLNs), analog networks of self-adjusting resistors. It adapts Q-learning to operate on a simulated CLLN, encoding environmental states as input voltages and interpreting outputs as action-values, with updates governed by the local contrastive rule and driven by a contrastive power difference . Two tasks—a four-state, four-action MDP and a nine-state grid navigation task—show near-optimal performance in most trials, validating the approach. The work discusses which RL components map naturally to CLLNs (e.g., policy/value expectations) versus which require additional hardware (e.g., replay buffers or memory), and outlines how physical constraints shape learning in analog substrates. It also highlights implications for energy efficiency, robustness to damage, and opportunities to tailor learning toward hardware-friendly objectives.

Abstract

Digital computers are power-hungry and largely intolerant of damaged components, making them potentially difficult tools for energy-limited autonomous agents in uncertain environments. Recently developed Contrastive Local Learning Networks (CLLNs) - analog networks of self-adjusting nonlinear resistors - are inherently low-power and robust to physical damage, but were constructed to perform supervised learning. In this work we demonstrate success on two simple RL problems using Q-learning adapted for simulated CLLNs. Doing so makes explicit the components (beyond the network being trained) required to enact various tools in the RL toolbox, some of which (policy function and value function) are more natural in this system than others (replay buffer). We discuss assumptions such as the physical safety that digital hardware requires, CLLNs can forgo, and biological systems cannot rely on, and highlight secondary goals that are important in biology and trainable in CLLNs, but make little sense in digital computers.

Paper Structure

This paper contains 9 sections, 6 equations, 4 figures.

Figures (4)

  • Figure 1: A schematic of reinforcement learning in three scenarios. Left: A digital agent learns in a digital environment, both simulated by a computer. Middle: A digital agent interacts with a physical environment, and the learning process is controlled by a computer. Right (proposed): A Contrastive Local Learning Network interacts with a physical environment, and learns based on those interactions. There is no digital processing; learning is done in a distributed, analog fashion. One of the many identical self-adjusting components is highlighted.
  • Figure 2: Schematic of a Contrastive Local Learning Network (CLLN). The configuration shown is used for the Markov decision process with four states and four actions. A modified network architecture and input-output position is used for the 9-state navigation task (see Fig. \ref{['fig:fig4']}). A high-level description of the contrastive training protocol is outlined in the gray box. Note that each element in a CLLN requires only local measurements of the two states (free and clamped) to update itself, decentralizing the training process.
  • Figure 3: Q Learning with CLLN. (A) Reward schedule for each state (noise effect shown as small error bars). The optimal strategy involves a cycle through all four states. (B) Average reward over training for 10 trials (purples) overlaid with their average (black).
  • Figure 4: Navigation Task. (A) Reward schedule, with all states approximately equal except for the target (upper left) state. There is no reward noise. The optimal action in each grid is indicated via arrows. (B) Average reward over training for 10 trials (purples) is overlaid with their average (black). (C) The 10 strategies at the end of training are each simulated for 10,000 steps (reset randomly every 5 steps), and the fraction of total time spent in each state is shown as a heatmap.