SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Ke Ye; Heinrich Jiang; Afshin Rostamizadeh; Ayan Chakrabarti; Giulia DeSalvo; Jean-François Kagy; Lazaros Karydas; Gui Citovsky; Sanjiv Kumar

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar

TL;DR

SpacTor targets the inefficiency of self-supervised pre-training by introducing a hybrid span corruption and replaced token detection objective for encoder-decoder models, augmented with a two-stage curriculum that transitions from the hybrid objective to plain span corruption. The method uses a generator–discriminator setup to create plausible token replacements and to detect replacements, while also denoising corrupted spans, with a final loss blending three terms. Empirically, SpacTor matches standard SC performance while reducing pre-training iterations by about 50% and FLOPs by about 40% on several NLP benchmarks, and scales to larger models with substantial compute savings. The results demonstrate that a staged training schedule mitigates the negative effects of noise from the generator, enabling efficient pre-training without sacrificing downstream performance, and point toward broader applicability to other architectures and data regimes.

Abstract

Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $τ$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

TL;DR

Abstract

iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

Paper Structure (19 sections, 11 equations, 5 figures, 17 tables)

This paper contains 19 sections, 11 equations, 5 figures, 17 tables.

Introduction
SpacTor Method
The Hybrid Pre-training Objective
Two-staged Pre-training
Experiments
Setup
Results
Single stage pre-training
With continued pre-training
Efficiency analysis
Large models
Related Work
Conclusion and Future Work
Training Hyperparameters
Pre-training Hyperparameters
...and 4 more sections

Figures (5)

Figure 1: The SpacTor pre-training objective in the first stage. In step (1), the original text is randomly corrupted with span corruption (marked as [S0], [S1], etc, ) and then token-level random masking (marked as [M]). A small auxiliary generator model $G$ is trained to recover [M] only. The resulting text is then fed into the T5 discriminator $D$, whose encoder component learns to predict at every position whether the token is a replaced one, while its decoder component learns to fill in the ground truth token as in standard span corruption.
Figure 2: SpacTor($\tau$) performances on SuperGLUE, SQuAD and CNN/DailyMail with respect to pre-training FLOPs. Here, we include SpacTor(250K) and SpacTor(120K) where the second pre-training stage (using the span corruption objective only) starts at 250K and 120K training steps respectively. The plots for the remaining tasks are presented in Appendix \ref{['app:score-flops-plot']}.
Figure 3: Average score on downstream tasks ($y$-axis) when continuously fine-tuning along the pre-training checkpoints ($x$-axis). The error band illustrates the min-max range over 5 independent runs.
Figure 4: (Left) Validation loss curve for baseline and SpacTor($\infty$). (Right) Validation cross-entropy loss differences between baseline and SpacTor($\infty$) evaluated with encoder input $X_{\mathrm{c}}$. The dashed line is the linear regression fits to the data starting at iteration 120K.
Figure 5: SpacTor performances on GLUE, Rainbow, BBH and MMLU with respect to pre-training FLOPs for T5-Base model.

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

TL;DR

Abstract

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)