Mixture of Masters: Sparse Chess Language Models with Player Routing
Giacomo Frisoni, Lorenzo Molfetta, Davide Freddi, Gianluca Moro
TL;DR
This work tackles the problem of stylistic homogenization in chess language models by introducing Mixture-of-Masters (MoM), a sparse mixture-of-experts that deploys multiple Grandmaster-inspired decoders. Each expert is trained with SSL to imitate a specific GM and then refined with GRPO-based RL to enforce legality while preserving the GM’s style; a gating network routes moves to the top-k experts, creating dynamic, interpretable persona switching. The authors validate MoM on unseen games against Stockfish, showing superior performance to dense baselines and model soups, while demonstrating that expert specialization yields robust stylistic signatures via a post-hoc behavioral stylometry framework that uses vision-based embeddings and contrastive learning. The approach offers a scalable, modular path to diverse, controllable chess AI with educational and analytical value, and highlights the trade-offs between legality, stylistic fidelity, and outcome-focused strength in constrained generation settings.
Abstract
Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. Each expert is trained with a combination of self-supervised learning and reinforcement learning guided by chess-specific rewards. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically$--$e.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.
