Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models
Jared Junkin, Samuel Nathanson
TL;DR
This paper examines whether causal masking can be effectively applied to spatial data by studying chess, a domain with both spatial (FEN) and sequential (PGN) representations. By training a 1.3B parameter Llama model on FEN with causal masking and comparing to PGN-trained and bidirectional FEN models, the authors show that spatial data with causal masking can yield superior or comparable performance, achieving an estimated Elo around 2630 in calibrated tests. The key contributions are the systematic comparison of encoding and masking strategies, the open-source fine-tuned LLM baseline, and the finding that tokenizer alignment and prompt design are crucial for success in structured symbolic domains. The results suggest broader implications for learning spatial datasets with unimodal LLMs and motivate further exploration of causal masking in other spatially structured tasks, beyond chess.
Abstract
Language models are traditionally designed around causal masking. In domains with spatial or relational structure, causal masking is often viewed as inappropriate, and sequential linearizations are instead used. Yet the question of whether it is viable to accept the information loss introduced by causal masking on nonsequential data has received little direct study, in part because few domains offer both spatial and sequential representations of the same dataset. In this work, we investigate this issue in the domain of chess, which naturally supports both representations. We train language models with bidirectional and causal self-attention mechanisms on both spatial (board-based) and sequential (move-based) data. Our results show that models trained on spatial board states - \textit{even with causal masking} - consistently achieve stronger playing strength than models trained on sequential data. While our experiments are conducted on chess, our results are methodological and may have broader implications: applying causal masking to spatial data is a viable procedure for training unimodal LLMs on spatial data, and in some domains is even preferable to sequentialization.
