Table of Contents
Fetching ...

When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift

Kevin Vogt-Lowell, Theodoros Tsiligkaridis, Rodney Lafuente-Mercado, Surabhi Ghatti, Shanghua Gao, Marinka Zitnik, Daniela Rus

TL;DR

Under a stochastic sensor failure process, it is proved a high-probability bound on infinite-horizon reward degradation that quantifies how robustness depends on policy smoothness and failure persistence.

Abstract

Real-world reinforcement learning systems must operate under distributional drift in their observation streams, yet most policy architectures implicitly assume fully observed and noise-free states. We study robustness of Proximal Policy Optimization (PPO) under temporally persistent sensor failures that induce partial observability and representation shift. To respond to this drift, we augment PPO with temporal sequence models, including Transformers and State Space Models (SSMs), to enable policies to infer missing information from history and maintain performance. Under a stochastic sensor failure process, we prove a high-probability bound on infinite-horizon reward degradation that quantifies how robustness depends on policy smoothness and failure persistence. Empirically, on MuJoCo continuous-control benchmarks with severe sensor dropout, we show Transformer-based sequence policies substantially outperform MLP, RNN, and SSM baselines in robustness, maintaining high returns even when large fractions of sensors are unavailable. These results demonstrate that temporal sequence reasoning provides a principled and practical mechanism for reliable operation under observation drift caused by sensor unreliability.

When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift

TL;DR

Under a stochastic sensor failure process, it is proved a high-probability bound on infinite-horizon reward degradation that quantifies how robustness depends on policy smoothness and failure persistence.

Abstract

Real-world reinforcement learning systems must operate under distributional drift in their observation streams, yet most policy architectures implicitly assume fully observed and noise-free states. We study robustness of Proximal Policy Optimization (PPO) under temporally persistent sensor failures that induce partial observability and representation shift. To respond to this drift, we augment PPO with temporal sequence models, including Transformers and State Space Models (SSMs), to enable policies to infer missing information from history and maintain performance. Under a stochastic sensor failure process, we prove a high-probability bound on infinite-horizon reward degradation that quantifies how robustness depends on policy smoothness and failure persistence. Empirically, on MuJoCo continuous-control benchmarks with severe sensor dropout, we show Transformer-based sequence policies substantially outperform MLP, RNN, and SSM baselines in robustness, maintaining high returns even when large fractions of sensors are unavailable. These results demonstrate that temporal sequence reasoning provides a principled and practical mechanism for reliable operation under observation drift caused by sensor unreliability.
Paper Structure (30 sections, 3 theorems, 33 equations, 3 figures, 2 tables)

This paper contains 30 sections, 3 theorems, 33 equations, 3 figures, 2 tables.

Key Result

Theorem 5.6

Assume assump:1:bounded–assump:5:indep. Fix $\delta\in(0,1)$. Then, with probability at least $1-\delta$, Moreover, the mean satisfies

Figures (3)

  • Figure 1: Sample PPO training curves on HalfCheetah-v4 under full (left) and 60% partial (right) observability. Lines represent median episodic return and shaded regions denote inter-quartile ranges across 8 random seeds. Training curves generated under partial observability rise more slowly and plateau at lower returns than those produced using fully observed states.
  • Figure 2: Evaluation episodic returns for PPO agents on MuJoCo environments under full (left) and 60% partial (right) observability, with task complexity roughly increasing from top to bottom. Each violin shows the distribution of pooled episodic returns from 100 episodes across 8 random seeds. Black markers denote the median with 95% bootstrapped CI. While all models suffer performance degradation under partial observability, the Transformer agent demonstrates greater robustness.
  • Figure 3: PPO training curves for Hopper-v4, Walker2d-v4, and Ant-v4 under full (left) and 60% partial (right) observability. Lines represent median episodic return and shaded regions denote inter-quartile ranges across 8 random seeds.

Theorems & Definitions (5)

  • Theorem 5.6: High-probability reward-degradation bound
  • Lemma A.1: Pointwise Wasserstein bound on the per-step loss
  • Lemma A.2
  • proof : Proof of Theorem \ref{['thm:main']}
  • Remark A.3: Signed version