Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Angad Singh Ahuja

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Angad Singh Ahuja

TL;DR

This work formalizes a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP, and proves a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit.

Abstract

Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter.

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

TL;DR

Abstract

Paper Structure (19 sections, 7 theorems, 102 equations, 1 figure, 6 tables, 2 algorithms)

This paper contains 19 sections, 7 theorems, 102 equations, 1 figure, 6 tables, 2 algorithms.

Motivation and Claim
Related Work
Methodology, Theoretical Development, and Diagnostics
Results
Discussion and Future Work
Limitations
Conclusion
Appendixes
Notation Table
Proofs
Proof of the latent minimax principle
Proof of the extreme-point corollary
Proof of the approximate best-response certificates
Proof of finite-sample sign certification
Proof of marginal insufficiency for fixed policies
...and 4 more sections

Key Result

Lemma 1

Suppose the reward is Then for any policy $\pi$, Therefore maximizing undiscounted expected return is exactly equivalent to minimizing expected shots-to-win.

Figures (1)

Figure 1: Training dynamics and IBR diagnostics averaged over independent seeds. (a) Fixed-mixture and alternating stress approaches readily acquire nominal proficiency without catastrophic gameplay degradation. (b) Direct targeted adversarial exposure effectively limits extreme worst-case trajectory vulnerabilities, explicitly lowering tail severity ($\mathrm{CVaR}_{0.10}$ and $p95$). (c) In Stage 2, tracked IBR metrics show that sufficient optimization bandwidth can yield positive defender_adversarial shifts and subsequent attacker_adaptation; these curves should be read as theorem-motivated diagnostics rather than as exact row-wise certificates.

Theorems & Definitions (14)

Definition 1: Battleship POMDP
Lemma 1: Undiscounted step penalty and shots-to-win
Remark 1: Discounted PPO surrogate in the implementation
Definition 2: Adversarial latent-initial-state POMDP
Definition 3: Deterministic history-dependent attacker policy
Theorem 1: Latent minimax principle
Corollary 1: Extreme-point defenders
Definition 4: Defender $\varepsilon$-best response
Definition 5: Attacker $\varepsilon$-best response to a defender mixture
Theorem 2: Approximate best-response certificates
...and 4 more

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

TL;DR

Abstract

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (14)