Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining
Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang
TL;DR
This work delivers the first theoretical comparison between autoregressive and masked generative SSL, framing both within a unified matrix-decomposition perspective on a co-occurrence matrix. It proves downstream guarantees for linear classification tied to the singular values of the co-occurrence matrix and identifies that masked SSL yields stronger cross-sample connectivity, while autoregressive SSL better supports generation due to objective-length alignment. To leverage both strengths, the authors propose diversity-enhanced autoregressive and variable-length masked objectives, and demonstrate substantial gains on language (GLUE) and vision (ImageNet) benchmarks as well as generation metrics (perplexity, reconstruction loss). The findings offer principled guidance for designing SSL objectives that improve classification and generation in tandem, with practical improvements across domains and clear avenues for future refinement.
Abstract
In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at https://github.com/PKU-ML/LookAheadLookAround.
