Table of Contents
Fetching ...

SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games

Adam Haile

Abstract

In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi-player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero's latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief-Aware MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity environment, transformer-based encoding, and a curriculum of heuristic opponents with self-play. In 1000-game head-to-head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo, $p < 10^{-50}$). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief-aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow.

SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games

Abstract

In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi-player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero's latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief-Aware MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity environment, transformer-based encoding, and a curriculum of heuristic opponents with self-play. In 1000-game head-to-head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo, ). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief-aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow.

Paper Structure

This paper contains 45 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A player's $3 \times 4$ card grid during a game of Skyjo, with the deck shown at right. Cards range from $-2$ to $12$, with lower total scores being more desirable. Image credit: 3rd Grade Thoughts (https://www.3rdgradethoughts.com/2019/01/board-game-review-skyjo.html).
  • Figure 2: Architecture comparison between baseline MuZero (left) and Belief-Aware MuZero / SkyNet (right). Both share the same representation, dynamics, and prediction networks. SkyNet adds an ego conditioning layer that injects player identity before prediction, and two auxiliary heads (winner and rank) that shape the latent representation via outcome-prediction objectives. Pink-shaded outputs indicate the additional belief heads.
  • Figure 3: Training loss curves for baseline MuZero and Belief-Aware MuZero. Raw values shown in light color; 20-iteration rolling averages in bold. Both models converge, with the belief-aware model's total loss higher due to the additional auxiliary loss terms.
  • Figure 4: Convergence of the belief-aware model's auxiliary prediction heads. The winner head loss (left) and rank head loss (right) both decrease substantially over training, confirming that the network learns to predict game outcomes with increasing accuracy.
  • Figure 5: Evaluation win rate against the heuristic bot curriculum over training. The belief-aware model consistently achieves higher win rates after an initial ramp-up period. The dashed line indicates the 50% random baseline.
  • ...and 3 more figures