Transformers versus the EM Algorithm in Multi-class Clustering
Yihan He, Hong-Yu Chen, Yuan Cao, Jianqing Fan, Han Liu
TL;DR
The paper addresses the problem of understanding how Transformer architectures can perform unsupervised multi-class clustering for Gaussian Mixture Models by linking Softmax attention to EM/Lloyd's algorithm. It develops a constructive approximation theory showing that pre-trained Transformers can emulate Lloyd iterations with explicit bounds and derives generalization guarantees for ERM pretraining. The key contributions include universal approximation results for multi-head Softmax mappings, generalization bounds, and minimax-rate results under sufficient pretraining and initialization, corroborated by simulations. The work advances the theoretical foundation of Transformer-based in-context learning for unsupervised algorithmic tasks and suggests practical implications for leveraging LLMs in clustering and related inference problems.
Abstract
LLMs demonstrate significant inference capacities in complicated machine learning tasks, using the Transformer model as its backbone. Motivated by the limited understanding of such models on the unsupervised learning problems, we study the learning guarantees of Transformers in performing multi-class clustering of the Gaussian Mixture Models. We develop a theory drawing strong connections between the Softmax Attention layers and the workflow of the EM algorithm on clustering the mixture of Gaussians. Our theory provides approximation bounds for the Expectation and Maximization steps by proving the universal approximation abilities of multivariate mappings by Softmax functions. In addition to the approximation guarantees, we also show that with a sufficient number of pre-training samples and an initialization, Transformers can achieve the minimax optimal rate for the problem considered. Our extensive simulations empirically verified our theory by revealing the strong learning capacities of Transformers even beyond the assumptions in the theory, shedding light on the powerful inference capacities of LLMs.
