Table of Contents
Fetching ...

Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

TL;DR

This work delivers the first theoretical comparison between autoregressive and masked generative SSL, framing both within a unified matrix-decomposition perspective on a co-occurrence matrix. It proves downstream guarantees for linear classification tied to the singular values of the co-occurrence matrix and identifies that masked SSL yields stronger cross-sample connectivity, while autoregressive SSL better supports generation due to objective-length alignment. To leverage both strengths, the authors propose diversity-enhanced autoregressive and variable-length masked objectives, and demonstrate substantial gains on language (GLUE) and vision (ImageNet) benchmarks as well as generation metrics (perplexity, reconstruction loss). The findings offer principled guidance for designing SSL objectives that improve classification and generation in tandem, with practical improvements across domains and clear avenues for future refinement.

Abstract

In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at https://github.com/PKU-ML/LookAheadLookAround.

Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

TL;DR

This work delivers the first theoretical comparison between autoregressive and masked generative SSL, framing both within a unified matrix-decomposition perspective on a co-occurrence matrix. It proves downstream guarantees for linear classification tied to the singular values of the co-occurrence matrix and identifies that masked SSL yields stronger cross-sample connectivity, while autoregressive SSL better supports generation due to objective-length alignment. To leverage both strengths, the authors propose diversity-enhanced autoregressive and variable-length masked objectives, and demonstrate substantial gains on language (GLUE) and vision (ImageNet) benchmarks as well as generation metrics (perplexity, reconstruction loss). The findings offer principled guidance for designing SSL objectives that improve classification and generation in tandem, with practical improvements across domains and clear avenues for future refinement.

Abstract

In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at https://github.com/PKU-ML/LookAheadLookAround.
Paper Structure (26 sections, 7 theorems, 69 equations, 3 figures, 9 tables)

This paper contains 26 sections, 7 theorems, 69 equations, 3 figures, 9 tables.

Key Result

Theorem 3.1

Let $\bar{A}$ be the normalized co-occurrence matrix, i.e., $\bar{A}_{X_i,X_i^+} = \frac{A_{X_i,X_i^+}}{\sqrt{P_C(X_i)P_G(X_i^+)}}$. Then we obtain where the $X_i$-th row of $F$ and the $X_i^+$-th row of $W'$ respectively represents encoded features and token embeddings, i.e., $F_{X_i} = \sqrt{P_C(X_i)}f(X_i)^\top$, $W'_{X_i^+} = \sqrt{P_G(X_i^+)}W_{X_i^+}$.

Figures (3)

  • Figure 1: Illustration of two primary paradigms in generative SSL: autoregressive SSL and masked SSL.
  • Figure 2: Comparisons on estimated connectivity of the co-occurrence matrices of GPT and BERT. Details in Appendix \ref{['appendix-estimated-detail']}
  • Figure 3: Comparison between different conditional sequence lengths in generation evaluation of masked SSL. Shorter conditional sequences suffer from low prediction likelihood.

Theorems & Definitions (12)

  • Theorem 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • proof
  • Lemma 1.1: Lemma 3.1 in haochen
  • proof
  • Lemma 1.2: Theorem 5.1 in zhang2023identifiable
  • proof
  • Lemma 1.3: Theorem 4.2 in zhang2023generalization
  • ...and 2 more