Table of Contents
Fetching ...

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Angad Singh Ahuja

TL;DR

This work formalizes a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP, and proves a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit.

Abstract

Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter.

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

TL;DR

This work formalizes a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP, and proves a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit.

Abstract

Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter.
Paper Structure (19 sections, 7 theorems, 102 equations, 1 figure, 6 tables, 2 algorithms)

This paper contains 19 sections, 7 theorems, 102 equations, 1 figure, 6 tables, 2 algorithms.

Key Result

Lemma 1

Suppose the reward is Then for any policy $\pi$, Therefore maximizing undiscounted expected return is exactly equivalent to minimizing expected shots-to-win.

Figures (1)

  • Figure 1: Training dynamics and IBR diagnostics averaged over independent seeds. (a) Fixed-mixture and alternating stress approaches readily acquire nominal proficiency without catastrophic gameplay degradation. (b) Direct targeted adversarial exposure effectively limits extreme worst-case trajectory vulnerabilities, explicitly lowering tail severity ($\mathrm{CVaR}_{0.10}$ and $p95$). (c) In Stage 2, tracked IBR metrics show that sufficient optimization bandwidth can yield positive defender_adversarial shifts and subsequent attacker_adaptation; these curves should be read as theorem-motivated diagnostics rather than as exact row-wise certificates.

Theorems & Definitions (14)

  • Definition 1: Battleship POMDP
  • Lemma 1: Undiscounted step penalty and shots-to-win
  • Remark 1: Discounted PPO surrogate in the implementation
  • Definition 2: Adversarial latent-initial-state POMDP
  • Definition 3: Deterministic history-dependent attacker policy
  • Theorem 1: Latent minimax principle
  • Corollary 1: Extreme-point defenders
  • Definition 4: Defender $\varepsilon$-best response
  • Definition 5: Attacker $\varepsilon$-best response to a defender mixture
  • Theorem 2: Approximate best-response certificates
  • ...and 4 more