Deep Dive into Model-free Reinforcement Learning for Biological and Robotic Systems: Theory and Practice

Yusheng Jiao; Feng Ling; Sina Heydari; Nicolas Heess; Josh Merel; Eva Kanso

Deep Dive into Model-free Reinforcement Learning for Biological and Robotic Systems: Theory and Practice

Yusheng Jiao, Feng Ling, Sina Heydari, Nicolas Heess, Josh Merel, Eva Kanso

TL;DR

The paper tackles how to formulate and implement model-free reinforcement learning for embodied biological and robotic systems, emphasizing the impact of morphology and physics on learning sensorimotor policies. It presents a rigorous actor-critic framework, derives the policy gradient, and details practical critic-approximation techniques such as $K$-step bootstrapping and $V_\phi$ losses. It then discusses design choices for modelers—observations, action spaces, rewards, termination, and decision timing—before illustrating PPO as a robust, scalable training method with environment simulation and network updates. The work provides a foundation for applying embodied RL to neuroscience and robotics, and points to future directions including recurrent architectures, transformers, and integration with large language models.

Abstract

Animals and robots exist in a physical world and must coordinate their bodies to achieve behavioral objectives. With recent developments in deep reinforcement learning, it is now possible for scientists and engineers to obtain sensorimotor strategies (policies) for specific tasks using physically simulated bodies and environments. However, the utility of these methods goes beyond the constraints of a specific task; they offer an exciting framework for understanding the organization of an animal sensorimotor system in connection to its morphology and physical interaction with the environment, as well as for deriving general design rules for sensing and actuation in robotic systems. Algorithms and code implementing both learning agents and environments are increasingly available, but the basic assumptions and choices that go into the formulation of an embodied feedback control problem using deep reinforcement learning may not be immediately apparent. Here, we present a concise exposition of the mathematical and algorithmic aspects of model-free reinforcement learning, specifically through the use of \textit{actor-critic} methods, as a tool for investigating the feedback control underlying animal and robotic behavior.

Deep Dive into Model-free Reinforcement Learning for Biological and Robotic Systems: Theory and Practice

TL;DR

-step bootstrapping and

losses. It then discusses design choices for modelers—observations, action spaces, rewards, termination, and decision timing—before illustrating PPO as a robust, scalable training method with environment simulation and network updates. The work provides a foundation for applying embodied RL to neuroscience and robotics, and points to future directions including recurrent architectures, transformers, and integration with large language models.

Abstract

Paper Structure (25 sections, 31 equations, 3 figures, 3 algorithms)

This paper contains 25 sections, 31 equations, 3 figures, 3 algorithms.

Introduction
Mathematical underpinnings of model-free RL
Actor-critic methods
Approximating the critic
Training the RL agent
Agent Parameterization
Agent Updates
Modeler Choices
Episode Initialization
Agent-centric Observations
Degree of partial observability
Observation invariance
Biological plausibility
Simplicity for function approximation
Goal-instructed behaviors
...and 10 more sections

Figures (3)

Figure 1: Reinforcement Learning for Embodied Biological and Robotic Systems:A. Schematic representation of canonical interactions of a biological organism with its environment. The organism, typically composed of multiple interacting subsystems, potentially with a nervous system and muscles and sensory organs, interacts with the external environment via its physical body: the body acts on the environment and receives sensory feedback including proprioception. B. Reinforcement Learning (RL) provides a framework for studying the interaction of an embodied RL agent with its environment. Modelers decide the type of actions and observations afforded by the embodied agent, as well as the rewards it obtains from the environment. Physical interactions with the environment dictate the time evolution of the state of the system given a choice of action. The RL agent collects observations $o_t$ and rewards $r_t$ and take actions $a_t$ according to the RL policy $\pi_\theta(a_t | o_t)$. Model-free RL is a type of learning where no explicit model of the physical transition rules $P(s_{t+1}|s_t,a_t)$ is used or learned by the agent in the process of learning an optimal policy for taking actions $a_t$. In the actor-critic methods, the RL agent is comprised of a policy (actor) $\pi_\theta(a_t|o_t)$ and a value function (critic) $V_\phi(o_t)$, parametrized by $\theta$ and $\phi$, respectively.
Figure 1: Reinforcement Learning: RL consists of learning a policy $\pi_\theta(a_t | o_t)$, parameterized by $\theta$, that maximizes an objective function $\mathcal{J}$. Learning is based on repeated interactions with the environment: starting from a rule for choosing an action $a_t$ when in state $s_t$, which could be initially random, the RL agent uses a noisy version of this rule to explore the environment and collects an ensemble of trajectories $\tau\equiv{a_t,o_t}$ and rewards $r_t$, based on which the policy $\pi_\theta(a_t | o_t)$ is updated. Model-free RL describes a framework where the policy update does not depend on an explicit model $P(s_{t+1}|s_t,a_t)$ that represents the interactions of the agent with its environment. For simplicity, we show a policy updated every episode, but in most RL implementations, the update cycle is independent of the choice of episode.
Figure 1: Data generation during a typical model-free RL training episode. Physics determines how the true state $s_t$ of the body and the world evolves, and observations $o_t$ are the agent-observable elements of the physics. Actions $a_t$ are determined by the agent's policy $\pi_\theta$. The policy could retain a memory of previous observations and actions not shown in this figure. The sequence of immediate rewards $r_t$, while assigned based on the resulting physical state in our model, can in general depend on the previous action taken and the current observations according to the agent's interpretation. Estimated infinite-horizon returns $\widehat{R}_t$ are then computed from the discounted future rewards plus the agent's expected return estimated by the value function $V_T = V_\phi(o_T)$ after the available storage of rewards is exhausted. The objective of the learning agent is to maximize the expected value of return over all possible trajectories.

Deep Dive into Model-free Reinforcement Learning for Biological and Robotic Systems: Theory and Practice

TL;DR

Abstract

Deep Dive into Model-free Reinforcement Learning for Biological and Robotic Systems: Theory and Practice

Authors

TL;DR

Abstract

Table of Contents

Figures (3)