Table of Contents
Fetching ...

Conjectural Online Learning with First-order Beliefs in Asymmetric Information Stochastic Games

Tao Li, Kim Hammar, Rolf Stadler, Quanyan Zhu

TL;DR

This work addresses learning in asymmetric-information stochastic games (AISGs) by replacing intractable belief hierarchies with first-order beliefs within a conjectural online learning (COL) framework. COL employs a forecaster-actor-critic (FAC) architecture: a Bayesian forecaster generates conjectures about an opponent's strategy $\widehat{\pi}_{-k,t}$ over a lookahead horizon $\ell_k$, a critic evaluates the value $\widehat{J}^{(\pi_k,\widehat{\pi}_{-k})}$, and an actor updates the policy via an $\ell_k$-step rollout; conjectures are continuously refined using information feedback $\mathbf{i}^k_t$. A KL-based consistency metric $K(\widehat{\ell}_{-k}, \bm{\nu})$ governs conjecture consistency, and the posterior $\mu_t^k$ concentrates on the set of consistent conjectures $\Theta_k^*(\bm{\nu})$, yielding empirical convergence to a Berk-Nash equilibrium in repeated AISGs. Theoretical results show asymptotic consistency of conjectures and convergence of the induced strategy profile to Berk-Nash, while the intrusion-response case study demonstrates faster and more stable adaptation to nonstationary attacks compared with reinforcement learning baselines. Overall, COL provides a practical online learning approach for resilient decision-making in socio-technical systems with asymmetric information, with potential applications in cyber-defense and IT infrastructure management.

Abstract

Asymmetric information stochastic games (AISGs) arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures. Existing computational methods for AISGs are primarily offline and can not adapt to equilibrium deviations. Further, current methods are limited to particular information structures to avoid belief hierarchies. Considering these limitations, we propose conjectural online learning (COL), an online learning method under generic information structures in AISGs. COL uses a forecaster-actor-critic (FAC) architecture, where subjective forecasts are used to conjecture the opponents' strategies within a lookahead horizon, and Bayesian learning is used to calibrate the conjectures. To adapt strategies to nonstationary environments based on information feedback, COL uses online rollout with cost function approximation (actor-critic). We prove that the conjectures produced by COL are asymptotically consistent with the information feedback in the sense of a relaxed Bayesian consistency. We also prove that the empirical strategy profile induced by COL converges to the Berk-Nash equilibrium, a solution concept characterizing rationality under subjectivity. Experimental results from an intrusion response use case demonstrate COL's {faster convergence} over state-of-the-art reinforcement learning methods against nonstationary attacks.

Conjectural Online Learning with First-order Beliefs in Asymmetric Information Stochastic Games

TL;DR

This work addresses learning in asymmetric-information stochastic games (AISGs) by replacing intractable belief hierarchies with first-order beliefs within a conjectural online learning (COL) framework. COL employs a forecaster-actor-critic (FAC) architecture: a Bayesian forecaster generates conjectures about an opponent's strategy over a lookahead horizon , a critic evaluates the value , and an actor updates the policy via an -step rollout; conjectures are continuously refined using information feedback . A KL-based consistency metric governs conjecture consistency, and the posterior concentrates on the set of consistent conjectures , yielding empirical convergence to a Berk-Nash equilibrium in repeated AISGs. Theoretical results show asymptotic consistency of conjectures and convergence of the induced strategy profile to Berk-Nash, while the intrusion-response case study demonstrates faster and more stable adaptation to nonstationary attacks compared with reinforcement learning baselines. Overall, COL provides a practical online learning approach for resilient decision-making in socio-technical systems with asymmetric information, with potential applications in cyber-defense and IT infrastructure management.

Abstract

Asymmetric information stochastic games (AISGs) arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures. Existing computational methods for AISGs are primarily offline and can not adapt to equilibrium deviations. Further, current methods are limited to particular information structures to avoid belief hierarchies. Considering these limitations, we propose conjectural online learning (COL), an online learning method under generic information structures in AISGs. COL uses a forecaster-actor-critic (FAC) architecture, where subjective forecasts are used to conjecture the opponents' strategies within a lookahead horizon, and Bayesian learning is used to calibrate the conjectures. To adapt strategies to nonstationary environments based on information feedback, COL uses online rollout with cost function approximation (actor-critic). We prove that the conjectures produced by COL are asymptotically consistent with the information feedback in the sense of a relaxed Bayesian consistency. We also prove that the empirical strategy profile induced by COL converges to the Berk-Nash equilibrium, a solution concept characterizing rationality under subjectivity. Experimental results from an intrusion response use case demonstrate COL's {faster convergence} over state-of-the-art reinforcement learning methods against nonstationary attacks.
Paper Structure (12 sections, 3 theorems, 29 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 3 theorems, 29 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For any sequence $(\bm{\pi}_{\mathbf{h}_t}, \bm{\nu}_{\mathbf{h}_t})_{t\geq 1}$ from Alg. alg:online_rollout, a.s.-$\mathbb{P}^{\mathscr{B},\mathscr{R}}$, where $\mathbb{P}^{\mathscr{B},\mathscr{R}}$ denotes the probability measure over the set of realizable histories $\mathbf{h}_t$ induced by $(\bm{\pi}_{\mathbf{h}_t})_{t\geq 1}$ under the rollout ($\mathscr{R}$) and Bayesian belief update ($\ma

Figures (2)

  • Figure 1: One-step cycle in col: conjectural online learning (see also Alg. \ref{['alg:online_rollout']}); the player $\mathrm{k}$ updates its conjecture $\widehat{\ell}_{-\mathrm{k},t}$ about the opponent's policy parameterization by sampling from the posterior $\mu_{t}^{\mathrm{k}}$, from which it forecasts opponent's future moves $\widehat{\pi}_{-\mathrm{k},t}$ conditional on its own first-order beliefs $\mathbf{b}_t^{\mathrm{k}}$; a rollout-based actor-critic creates policy improvement against the conjectured opponent.
  • Figure 2: Evaluation results for the intrusion response case study; values indicate the mean; the shaded areas and the error bars indicate the 95% confidence interval based on $20$ random seeds; hyperparameters are listed in online appendix.

Theorems & Definitions (9)

  • Theorem 1
  • proof
  • Definition 1: Berk-Nash Equilibrium, adapted from esponda16berk
  • Corollary 1
  • proof
  • Lemma 1
  • proof
  • proof
  • proof