Table of Contents
Fetching ...

Mixture of Masters: Sparse Chess Language Models with Player Routing

Giacomo Frisoni, Lorenzo Molfetta, Davide Freddi, Gianluca Moro

TL;DR

This work tackles the problem of stylistic homogenization in chess language models by introducing Mixture-of-Masters (MoM), a sparse mixture-of-experts that deploys multiple Grandmaster-inspired decoders. Each expert is trained with SSL to imitate a specific GM and then refined with GRPO-based RL to enforce legality while preserving the GM’s style; a gating network routes moves to the top-k experts, creating dynamic, interpretable persona switching. The authors validate MoM on unseen games against Stockfish, showing superior performance to dense baselines and model soups, while demonstrating that expert specialization yields robust stylistic signatures via a post-hoc behavioral stylometry framework that uses vision-based embeddings and contrastive learning. The approach offers a scalable, modular path to diverse, controllable chess AI with educational and analytical value, and highlights the trade-offs between legality, stylistic fidelity, and outcome-focused strength in constrained generation settings.

Abstract

Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. Each expert is trained with a combination of self-supervised learning and reinforcement learning guided by chess-specific rewards. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically$--$e.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.

Mixture of Masters: Sparse Chess Language Models with Player Routing

TL;DR

This work tackles the problem of stylistic homogenization in chess language models by introducing Mixture-of-Masters (MoM), a sparse mixture-of-experts that deploys multiple Grandmaster-inspired decoders. Each expert is trained with SSL to imitate a specific GM and then refined with GRPO-based RL to enforce legality while preserving the GM’s style; a gating network routes moves to the top-k experts, creating dynamic, interpretable persona switching. The authors validate MoM on unseen games against Stockfish, showing superior performance to dense baselines and model soups, while demonstrating that expert specialization yields robust stylistic signatures via a post-hoc behavioral stylometry framework that uses vision-based embeddings and contrastive learning. The approach offers a scalable, modular path to diverse, controllable chess AI with educational and analytical value, and highlights the trade-offs between legality, stylistic fidelity, and outcome-focused strength in constrained generation settings.

Abstract

Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. Each expert is trained with a combination of self-supervised learning and reinforcement learning guided by chess-specific rewards. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamicallye.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.
Paper Structure (62 sections, 7 equations, 19 figures, 8 tables)

This paper contains 62 sections, 7 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Illustration of MoM. First, multiple decoder-only chess language models are trained to emulate the game decisions of specific grandmasters. Then, their layers are combined into a sparse language model by alternating uniform weight merging and top-$k$ routing for next move prediction.
  • Figure 2: Overview of the visual chess player identification system.Left: During training, game embeddings are processed through contrastive learning against GM-specific centroids to enforce intra-player similarity and inter-player distinctiveness. Right: The visual encoding pipeline processes consecutive chess board frames to extract and temporally aggregate spatial patch tokens (in blue), with positional and temporal encodings generating the final game embedding.
  • Figure 3: Ablation studies. (a) Effect of seed model on expert FIDEScore; SSL-only, Stockfish 1, pooled over 10 runs. (b) Effect of expert count $k$ on game results; MoM (top-5 exp. by FIDEScore), Stockfish 0, pooled over 10 runs. (c) Effect of RL on legality; Karvonen seed, Stockfish 1, pooled over 10 runs.
  • Figure 3: Grandmaster dataset statistics. Aggregated view (train, test). Played games span from 1984 to 2025.
  • Figure 4: Style Consistency (left): Relative change in cosine distance when computing expert-specific centroids from random subsamples of played games; Style Acquisition (right): Recall of style-similarity retrieval mapping of played games to the correct real-GM centroid.
  • ...and 14 more figures