Table of Contents
Fetching ...

Reinforcement Learning with Action-Triggered Observations

Alexander Ryabchenko, Wenlong Mou

TL;DR

This work introduces Action-Triggered Sporadically Traceable MDPs (ATST-MDPs), a framework where state observations occur only when actions trigger data-bursts, capturing practical constraints like active sensing and costly feedback. It develops a Bellman framework on augmented states, and defines an action-sequence value-function that summarizes rewards across bursts, enabling a linear representation via an induced feature map $\boldsymbol{\psi}$ under Linear MDPs. The paper proves linearity results for the action-sequence value-function and provides off-policy estimation guarantees for the feature map, then introduces ST-LSVI-UCB, an algorithm achieving near-optimal regret $\widetilde{O}(\sqrt{d^3 K (1-\gamma)^{-3}})$ in episodic learning with geometric horizons, given accurate estimation of $\boldsymbol{\psi}$. This work lays theoretical foundations for learning under sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such information constraints, with practical avenues via off-policy data and approximate optimization. The results offer a principled approach to information-constrained RL, informing both theory and potential real-world deployments with intermittent feedback and observation costs.

Abstract

We study reinforcement learning problems where state observations are stochastically triggered by actions, a constraint common in many real-world applications. This framework is formulated as Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), where each action has a specified probability of triggering a state observation. We derive tailored Bellman optimality equations for this framework and introduce the action-sequence learning paradigm in which agents commit to executing a sequence of actions until the next observation arrives. Under the linear MDP assumption, value-functions are shown to admit linear representations in an induced action-sequence feature map. Leveraging this structure, we propose off-policy estimators with statistical error guarantees for such feature maps and introduce ST-LSVI-UCB, a variant of LSVI-UCB adapted for action-triggered settings. ST-LSVI-UCB achieves regret $\widetilde O(\sqrt{Kd^3(1-γ)^{-3}})$, where $K$ is the number of episodes, $d$ the feature dimension, and $γ$ the discount factor (per-step episode non-termination probability). Crucially, this work establishes the theoretical foundation for learning with sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such observation constraints.

Reinforcement Learning with Action-Triggered Observations

TL;DR

This work introduces Action-Triggered Sporadically Traceable MDPs (ATST-MDPs), a framework where state observations occur only when actions trigger data-bursts, capturing practical constraints like active sensing and costly feedback. It develops a Bellman framework on augmented states, and defines an action-sequence value-function that summarizes rewards across bursts, enabling a linear representation via an induced feature map under Linear MDPs. The paper proves linearity results for the action-sequence value-function and provides off-policy estimation guarantees for the feature map, then introduces ST-LSVI-UCB, an algorithm achieving near-optimal regret in episodic learning with geometric horizons, given accurate estimation of . This work lays theoretical foundations for learning under sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such information constraints, with practical avenues via off-policy data and approximate optimization. The results offer a principled approach to information-constrained RL, informing both theory and potential real-world deployments with intermittent feedback and observation costs.

Abstract

We study reinforcement learning problems where state observations are stochastically triggered by actions, a constraint common in many real-world applications. This framework is formulated as Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), where each action has a specified probability of triggering a state observation. We derive tailored Bellman optimality equations for this framework and introduce the action-sequence learning paradigm in which agents commit to executing a sequence of actions until the next observation arrives. Under the linear MDP assumption, value-functions are shown to admit linear representations in an induced action-sequence feature map. Leveraging this structure, we propose off-policy estimators with statistical error guarantees for such feature maps and introduce ST-LSVI-UCB, a variant of LSVI-UCB adapted for action-triggered settings. ST-LSVI-UCB achieves regret , where is the number of episodes, the feature dimension, and the discount factor (per-step episode non-termination probability). Crucially, this work establishes the theoretical foundation for learning with sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such observation constraints.

Paper Structure

This paper contains 41 sections, 37 theorems, 100 equations, 1 figure, 1 algorithm.

Key Result

Theorem 2.5

Under augmented policy $\pi:\mathcal{X}\to\mathcal{A}$, the action value-function satisfies:

Figures (1)

  • Figure 1: Execution protocol of the ATST-MDP over $K$ episodes with geometric horizons.

Theorems & Definitions (68)

  • Example 2.2: Faulty communication channel
  • Example 2.3: Paid observations
  • Example 2.4: Reset-to-observe
  • Theorem 2.5
  • Definition 2.6
  • Lemma 3.1: Linearity of belief
  • Theorem 3.2: Linearity of $K^{\pi}$
  • Definition 3.3
  • Theorem 3.4
  • Lemma 3.5
  • ...and 58 more