Incremental Learning of Sparse Attention Patterns in Transformers
Oğuz Kaan Yüksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion
TL;DR
This work investigates how transformers learn to compose information across multiple past positions using sparse attention, framing the problem as a high-order Markov chain task. It shows that learning unfolds in stage-like phases where heads first capture the most statisti- cally salient patterns and later specialize cooperatively on additional patterns, a dynamic captured by simplified gradient-flow equations linked to tensor-factorization. The authors provide convergence results for the competitive and cooperative phases and demonstrate how early stopping induces a beneficial misspecification regularization, with implications for generalization in language and reasoning tasks. Through a regression-variant analysis and a minimal architectural setting, the paper elucidates the mechanisms by which sparse attention patterns emerge and coordinate to solve complex sequential tasks. Overall, the results offer a theoretical foundation for staged learning in transformers and its impact on generalization and sample efficiency in data-constrained regimes.
Abstract
This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that transformers learn this task incrementally: each stage is defined by the acquisition of specific information through sparse attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that transformers ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an implicit regularizer, biasing the model toward these simpler classes. These results provide a theoretical foundation for the emergence of staged learning and complex behaviors in transformers, offering insights into generalization for natural language processing and algorithmic reasoning.
