Table of Contents
Fetching ...

When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

Lenart Treven, Bhavya Sukhija, Yarden As, Florian Dörfler, Andreas Krause

TL;DR

This work formalizes an RL framework, Time-adaptive Control&Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application, and results in an extended MDP that any standard RL algorithm can solve.

Abstract

Reinforcement learning (RL) excels in optimizing policies for discrete-time Markov decision processes (MDP). However, various systems are inherently continuous in time, making discrete-time MDPs an inexact modeling choice. In many applications, such as greenhouse control or medical treatments, each interaction (measurement or switching of action) involves manual intervention and thus is inherently costly. Therefore, we generally prefer a time-adaptive approach with fewer interactions with the system. In this work, we formalize an RL framework, Time-adaptive Control & Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application. Our formulation results in an extended MDP that any standard RL algorithm can solve. We demonstrate that state-of-the-art RL algorithms trained on TaCoS drastically reduce the interaction amount over their discrete-time counterpart while retaining the same or improved performance, and exhibiting robustness over discretization frequency. Finally, we propose OTaCoS, an efficient model-based algorithm for our setting. We show that OTaCoS enjoys sublinear regret for systems with sufficiently smooth dynamics and empirically results in further sample-efficiency gains.

When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

TL;DR

This work formalizes an RL framework, Time-adaptive Control&Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application, and results in an extended MDP that any standard RL algorithm can solve.

Abstract

Reinforcement learning (RL) excels in optimizing policies for discrete-time Markov decision processes (MDP). However, various systems are inherently continuous in time, making discrete-time MDPs an inexact modeling choice. In many applications, such as greenhouse control or medical treatments, each interaction (measurement or switching of action) involves manual intervention and thus is inherently costly. Therefore, we generally prefer a time-adaptive approach with fewer interactions with the system. In this work, we formalize an RL framework, Time-adaptive Control & Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application. Our formulation results in an extended MDP that any standard RL algorithm can solve. We demonstrate that state-of-the-art RL algorithms trained on TaCoS drastically reduce the interaction amount over their discrete-time counterpart while retaining the same or improved performance, and exhibiting robustness over discretization frequency. Finally, we propose OTaCoS, an efficient model-based algorithm for our setting. We show that OTaCoS enjoys sublinear regret for systems with sufficiently smooth dynamics and empirically results in further sample-efficiency gains.
Paper Structure (25 sections, 18 theorems, 56 equations, 7 figures)

This paper contains 25 sections, 18 theorems, 56 equations, 7 figures.

Key Result

Proposition 1

The problem in eq:transition cost setting and eq:bounded cost setting are equivalent to eq: transition cost reforumalted and eq: bounded switches reforumalted, respectively.

Figures (7)

  • Figure 1: Experiment on the Pendulum environment for the average cost and a bounded number of switches setting.
  • Figure 2: We study the effects of the bound on interactions $K$ on the performance of the agent. TaCoS performs significantly better than equidistant discretization, especially for small values of $K$.
  • Figure 3: Effect of interaction cost (first row) and environment stochasticity (second row) on the number of interactions and episode reward for the Pendulum and Greenhouse tasks.
  • Figure 4: We compare the performance of TaCoS in combination with SAC and PPO with the standard SAC algorithm and SAC with more compute (SAC-MC) over a range of values for $t_{\min}$ (first row). In the second row, we plot the episode reward versus the physical time in seconds spent in the environment for SAC-TaCoS, SAC, and SAC-MC for a specific evaluation frequency $1/t_{\text{eval}}$. We exclude PPO-TaCoS in this plot as it, being on-policy, requires significantly more samples than the off-policy methods. While all methods perform equally well for standard discretization (denoted with $1/t^*$), our method is robust to interaction frequency and does not suffer a performance drop when we decrease $t_{min}$.
  • Figure 5: We run OTaCoS on the pendulum and RC car environment. We report the achieved reward averaged over five different seeds with one standard error.
  • ...and 2 more figures

Theorems & Definitions (36)

  • Proposition 1
  • Definition 1: Well-calibrated statistical model of ${\bm{\Phi}}^*$, rothfuss2023hallucinated
  • Theorem 2
  • Definition 2: Model Complexity
  • Definition 3
  • Lemma 3: Difference lemma
  • proof
  • Lemma 4: Per episode regret bound
  • proof
  • Lemma 5: Objective upper bound
  • ...and 26 more