Table of Contents
Fetching ...

Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

Ted Xiao, Eric Jang, Dmitry Kalashnikov, Sergey Levine, Julian Ibarz, Karol Hausman, Alexander Herzog

TL;DR

The paper addresses reinforcement learning in concurrent environments where action selection occurs during ongoing dynamics, formalizing a continuous-time Bellman framework and a latency-aware discretization. It shows that augmenting Q-learning with minimal concurrency information, particularly previous actions and action-selection latency, preserves contraction and convergence, and introduces Vector-to-go (VTG) as a robust representation. Through toy control tasks and large-scale robotic grasping experiments, including real-robot results, the approach achieves faster, smoother policies with comparable task success to blocking baselines. The work demonstrates practical gains in speed and motion quality for real-time robotic control and lays groundwork for future extensions to other RL methods and latency regimes.

Abstract

We study reinforcement learning in settings where sampling an action from the policy must be done concurrently with the time evolution of the controlled system, such as when a robot must decide on the next action while still performing the previous action. Much like a person or an animal, the robot must think and move at the same time, deciding on its next action before the previous one has completed. In order to develop an algorithmic framework for such concurrent control problems, we start with a continuous-time formulation of the Bellman equations, and then discretize them in a way that is aware of system delays. We instantiate this new class of approximate dynamic programming methods via a simple architectural extension to existing value-based deep reinforcement learning algorithms. We evaluate our methods on simulated benchmark tasks and a large-scale robotic grasping task where the robot must "think while moving".

Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

TL;DR

The paper addresses reinforcement learning in concurrent environments where action selection occurs during ongoing dynamics, formalizing a continuous-time Bellman framework and a latency-aware discretization. It shows that augmenting Q-learning with minimal concurrency information, particularly previous actions and action-selection latency, preserves contraction and convergence, and introduces Vector-to-go (VTG) as a robust representation. Through toy control tasks and large-scale robotic grasping experiments, including real-robot results, the approach achieves faster, smoother policies with comparable task success to blocking baselines. The work demonstrates practical gains in speed and motion quality for real-time robotic control and lays groundwork for future extensions to other RL methods and latency regimes.

Abstract

We study reinforcement learning in settings where sampling an action from the policy must be done concurrently with the time evolution of the controlled system, such as when a robot must decide on the next action while still performing the previous action. Much like a person or an animal, the robot must think and move at the same time, deciding on its next action before the previous one has completed. In order to develop an algorithmic framework for such concurrent control problems, we start with a continuous-time formulation of the Bellman equations, and then discretize them in a way that is aware of system delays. We instantiate this new class of approximate dynamic programming methods via a simple architectural extension to existing value-based deep reinforcement learning algorithms. We evaluate our methods on simulated benchmark tasks and a large-scale robotic grasping task where the robot must "think while moving".

Paper Structure

This paper contains 30 sections, 3 theorems, 17 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

The concurrent continuous-time Bellman operator is a contraction.

Figures (7)

  • Figure 1: Shaded nodes represent observed variables and unshaded nodes represent unobserved random variables. (a): In "blocking" MDPs, the environment state does not change while the agent records the current state and selects an action. (b): In "concurrent" MDPs, state and action dynamics are continuous-time stochastic processes $s(t)$ and ${a}_i(t)$. At time $t$, the agent observes the state of the world $s(t)$, but by the time it selects an action $a_i(t+t_{AS})$, the previous continuous-time action function $a_{i-1}(t-H+t_{AS"})$ has "rolled over" to an unobserved state $s(t+t_{AS})$. An agent that concurrently selects actions from old states while in motion may need to interrupt a previous action before it has finished executing its current trajectory.
  • Figure 2: In concurrent versions of Cartpole and Pendulum, we observe that providing the critic with VTG leads to more robust performance across all hyperparameters. (a) Environment rewards achieved by DQN with different network architectures [either a feedforward network (FNN) or a Long Short-Term Memory (LSTM) network] and different concurrent knowledge features [Unconditioned, Vector-to-go (VTG), or previous action and $t_{AS}$] on the concurrent Cartpole task for every hyperparameter in a sweep, sorted in decreasing order. (b) Environment rewards achieved by DQN with a FNN and different frame-stacking and concurrent knowledge parameters on the concurrent Pendulum task for every hyperparameter in a sweep, sorted in decreasing order. Larger area-under-curve implies more robustness to hyperparameter choices. Enlarged figures provided in Appendix \ref{['apdx:figures']}.
  • Figure 3: An overview of the robotic grasping task. A static manipulator arm attempts to grasp objects placed in bins front of it. In simulation, the objects are procedurally generated.
  • Figure 4: The execution order of different stages are shown relative to the sampling period $H$ as well as the latency $t_{AS}$. (a): In "blocking" environments, state capture and policy inference are assumed to be instantaneous. (b): In "concurrent" environments, state capture and policy inference are assumed to proceed concurrently to action execution.
  • Figure 5: Concurrent knowledge representations can be visualized through an example of a 2-D pointmass discrete-time toy task. Vector-to-go represents the remaining action that may be executed when the current state $s_t$ is observed. Previous action represents the full commanded action from the previous timestep.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • Lemma A.1
  • proof
  • proof
  • proof