Reinforcement Learning with Action-Triggered Observations
Alexander Ryabchenko, Wenlong Mou
TL;DR
This work introduces Action-Triggered Sporadically Traceable MDPs (ATST-MDPs), a framework where state observations occur only when actions trigger data-bursts, capturing practical constraints like active sensing and costly feedback. It develops a Bellman framework on augmented states, and defines an action-sequence value-function that summarizes rewards across bursts, enabling a linear representation via an induced feature map $\boldsymbol{\psi}$ under Linear MDPs. The paper proves linearity results for the action-sequence value-function and provides off-policy estimation guarantees for the feature map, then introduces ST-LSVI-UCB, an algorithm achieving near-optimal regret $\widetilde{O}(\sqrt{d^3 K (1-\gamma)^{-3}})$ in episodic learning with geometric horizons, given accurate estimation of $\boldsymbol{\psi}$. This work lays theoretical foundations for learning under sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such information constraints, with practical avenues via off-policy data and approximate optimization. The results offer a principled approach to information-constrained RL, informing both theory and potential real-world deployments with intermittent feedback and observation costs.
Abstract
We study reinforcement learning problems where state observations are stochastically triggered by actions, a constraint common in many real-world applications. This framework is formulated as Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), where each action has a specified probability of triggering a state observation. We derive tailored Bellman optimality equations for this framework and introduce the action-sequence learning paradigm in which agents commit to executing a sequence of actions until the next observation arrives. Under the linear MDP assumption, value-functions are shown to admit linear representations in an induced action-sequence feature map. Leveraging this structure, we propose off-policy estimators with statistical error guarantees for such feature maps and introduce ST-LSVI-UCB, a variant of LSVI-UCB adapted for action-triggered settings. ST-LSVI-UCB achieves regret $\widetilde O(\sqrt{Kd^3(1-γ)^{-3}})$, where $K$ is the number of episodes, $d$ the feature dimension, and $γ$ the discount factor (per-step episode non-termination probability). Crucially, this work establishes the theoretical foundation for learning with sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such observation constraints.
