When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

Lenart Treven; Bhavya Sukhija; Yarden As; Florian Dörfler; Andreas Krause

When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

Lenart Treven, Bhavya Sukhija, Yarden As, Florian Dörfler, Andreas Krause

TL;DR

This work formalizes an RL framework, Time-adaptive Control&Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application, and results in an extended MDP that any standard RL algorithm can solve.

Abstract

Reinforcement learning (RL) excels in optimizing policies for discrete-time Markov decision processes (MDP). However, various systems are inherently continuous in time, making discrete-time MDPs an inexact modeling choice. In many applications, such as greenhouse control or medical treatments, each interaction (measurement or switching of action) involves manual intervention and thus is inherently costly. Therefore, we generally prefer a time-adaptive approach with fewer interactions with the system. In this work, we formalize an RL framework, Time-adaptive Control & Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application. Our formulation results in an extended MDP that any standard RL algorithm can solve. We demonstrate that state-of-the-art RL algorithms trained on TaCoS drastically reduce the interaction amount over their discrete-time counterpart while retaining the same or improved performance, and exhibiting robustness over discretization frequency. Finally, we propose OTaCoS, an efficient model-based algorithm for our setting. We show that OTaCoS enjoys sublinear regret for systems with sufficiently smooth dynamics and empirically results in further sample-efficiency gains.

When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

TL;DR

Abstract

Paper Structure (25 sections, 18 theorems, 56 equations, 7 figures)

This paper contains 25 sections, 18 theorems, 56 equations, 7 figures.

Introduction
Contributions
Problem statement
Interaction cost
Bounded number of interactions
TaCoS: Time Adaptive Control or Sensing
Reforumlation of Interaction Cost setting to Discrete-time MDPs
Reformulation of Bounded Number of Interactions to Discrete-time MDPs
TaCoS with Model-free RL Algorithms
How does the bound on the number of interactions $K$ affect TaCoS?
How does the interaction cost magnitude influence TaCoS?
How does environment stochasticity influence the number of interactions?
How does $t_{\min}$ influence TaCoS?
Efficient Exploration for TaCoS via Model-Based RL
Related Work
...and 10 more sections

Key Result

Proposition 1

The problem in eq:transition cost setting and eq:bounded cost setting are equivalent to eq: transition cost reforumalted and eq: bounded switches reforumalted, respectively.

Figures (7)

Figure 1: Experiment on the Pendulum environment for the average cost and a bounded number of switches setting.
Figure 2: We study the effects of the bound on interactions $K$ on the performance of the agent. TaCoS performs significantly better than equidistant discretization, especially for small values of $K$.
Figure 3: Effect of interaction cost (first row) and environment stochasticity (second row) on the number of interactions and episode reward for the Pendulum and Greenhouse tasks.
Figure 4: We compare the performance of TaCoS in combination with SAC and PPO with the standard SAC algorithm and SAC with more compute (SAC-MC) over a range of values for $t_{\min}$ (first row). In the second row, we plot the episode reward versus the physical time in seconds spent in the environment for SAC-TaCoS, SAC, and SAC-MC for a specific evaluation frequency $1/t_{\text{eval}}$. We exclude PPO-TaCoS in this plot as it, being on-policy, requires significantly more samples than the off-policy methods. While all methods perform equally well for standard discretization (denoted with $1/t^*$), our method is robust to interaction frequency and does not suffer a performance drop when we decrease $t_{min}$.
Figure 5: We run OTaCoS on the pendulum and RC car environment. We report the achieved reward averaged over five different seeds with one standard error.
...and 2 more figures

Theorems & Definitions (36)

Proposition 1
Definition 1: Well-calibrated statistical model of ${\bm{\Phi}}^*$, rothfuss2023hallucinated
Theorem 2
Definition 2: Model Complexity
Definition 3
Lemma 3: Difference lemma
proof
Lemma 4: Per episode regret bound
proof
Lemma 5: Objective upper bound
...and 26 more

When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

TL;DR

Abstract

When to Sense and Control? A Time-adaptive Approach for Continuous-Time RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (36)