Table of Contents
Fetching ...

Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control

Riccardo De Monte, Matteo Cederle, Gian Antonio Susto

TL;DR

This work proposes two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer.

Abstract

State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.

Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control

TL;DR

This work proposes two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer.

Abstract

State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.
Paper Structure (22 sections, 6 equations, 15 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 6 equations, 15 figures, 8 tables, 2 algorithms.

Figures (15)

  • Figure 1: Results for streaming DRL algorithms SDAC, S2AC, and Stream AC$(\lambda)$ on MuJoCo Gym and DM Control Suite tasks.
  • Figure 2: Ablation study for SDAC and S2AC.
  • Figure 3: Result for the batch methods on MuJoCo Gym and DM Control Suite tasks. TD3-norm and SAC-norm denote the versions of TD3 and SAC with data normalization and the same network architectures used for the streaming approaches.
  • Figure 4: Finetuning performance of SDAC after pre-training with TD3-norm using Adam as the critic optimizer. For each environment, we report three different intermediate pre-training checkpoints across different seeds. Moreover, for each of them we averaged the results across three seeds of finetuning. The horizontal dashed lines represent the agent performance before finetuning.
  • Figure 5: $L^2$-norm of the critic's network weights across 1M steps of training.
  • ...and 10 more figures