Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

Qi Zhang; Tianqi Du; Haotian Huang; Yifei Wang; Yisen Wang

Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

TL;DR

This work delivers the first theoretical comparison between autoregressive and masked generative SSL, framing both within a unified matrix-decomposition perspective on a co-occurrence matrix. It proves downstream guarantees for linear classification tied to the singular values of the co-occurrence matrix and identifies that masked SSL yields stronger cross-sample connectivity, while autoregressive SSL better supports generation due to objective-length alignment. To leverage both strengths, the authors propose diversity-enhanced autoregressive and variable-length masked objectives, and demonstrate substantial gains on language (GLUE) and vision (ImageNet) benchmarks as well as generation metrics (perplexity, reconstruction loss). The findings offer principled guidance for designing SSL objectives that improve classification and generation in tandem, with practical improvements across domains and clear avenues for future refinement.

Abstract

In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at https://github.com/PKU-ML/LookAheadLookAround.

Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

TL;DR

Abstract

Paper Structure (26 sections, 7 theorems, 69 equations, 3 figures, 9 tables)

This paper contains 26 sections, 7 theorems, 69 equations, 3 figures, 9 tables.

Introduction
Related Work
Mathematical Formulation
Pretraining Objectives
Revisiting Objectives from a Matrix Perspective
Downstream Tasks
A Theoretical Comparison between Autoregressive and Masked SSL
Generalization on Linear Classification
Downstream Classification Guarantees of Generative SSL
Comparing Autoregressive and Masked SSL
Generalization on Content Generation
Discussion
Experiments
Diversity-enhanced Autoregressive Objective Improves Classification Ability
Variable-length Masked Objective Improves Generation Ability
...and 11 more sections

Key Result

Theorem 3.1

Let $\bar{A}$ be the normalized co-occurrence matrix, i.e., $\bar{A}_{X_i,X_i^+} = \frac{A_{X_i,X_i^+}}{\sqrt{P_C(X_i)P_G(X_i^+)}}$. Then we obtain where the $X_i$-th row of $F$ and the $X_i^+$-th row of $W'$ respectively represents encoded features and token embeddings, i.e., $F_{X_i} = \sqrt{P_C(X_i)}f(X_i)^\top$, $W'_{X_i^+} = \sqrt{P_G(X_i^+)}W_{X_i^+}$.

Figures (3)

Figure 1: Illustration of two primary paradigms in generative SSL: autoregressive SSL and masked SSL.
Figure 2: Comparisons on estimated connectivity of the co-occurrence matrices of GPT and BERT. Details in Appendix \ref{['appendix-estimated-detail']}
Figure 3: Comparison between different conditional sequence lengths in generation evaluation of masked SSL. Shorter conditional sequences suffer from low prediction likelihood.

Theorems & Definitions (12)

Theorem 3.1
Theorem 4.1
Theorem 4.2
Theorem 4.3
proof
Lemma 1.1: Lemma 3.1 in haochen
proof
Lemma 1.2: Theorem 5.1 in zhang2023identifiable
proof
Lemma 1.3: Theorem 4.2 in zhang2023generalization
...and 2 more

Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

TL;DR

Abstract

Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)