Table of Contents
Fetching ...

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar

TL;DR

SpacTor targets the inefficiency of self-supervised pre-training by introducing a hybrid span corruption and replaced token detection objective for encoder-decoder models, augmented with a two-stage curriculum that transitions from the hybrid objective to plain span corruption. The method uses a generator–discriminator setup to create plausible token replacements and to detect replacements, while also denoising corrupted spans, with a final loss blending three terms. Empirically, SpacTor matches standard SC performance while reducing pre-training iterations by about 50% and FLOPs by about 40% on several NLP benchmarks, and scales to larger models with substantial compute savings. The results demonstrate that a staged training schedule mitigates the negative effects of noise from the generator, enabling efficient pre-training without sacrificing downstream performance, and point toward broader applicability to other architectures and data regimes.

Abstract

Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $τ$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

TL;DR

SpacTor targets the inefficiency of self-supervised pre-training by introducing a hybrid span corruption and replaced token detection objective for encoder-decoder models, augmented with a two-stage curriculum that transitions from the hybrid objective to plain span corruption. The method uses a generator–discriminator setup to create plausible token replacements and to detect replacements, while also denoising corrupted spans, with a final loss blending three terms. Empirically, SpacTor matches standard SC performance while reducing pre-training iterations by about 50% and FLOPs by about 40% on several NLP benchmarks, and scales to larger models with substantial compute savings. The results demonstrate that a staged training schedule mitigates the negative effects of noise from the generator, enabling efficient pre-training without sacrificing downstream performance, and point toward broader applicability to other architectures and data regimes.

Abstract

Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.
Paper Structure (19 sections, 11 equations, 5 figures, 17 tables)

This paper contains 19 sections, 11 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: The SpacTor pre-training objective in the first stage. In step (1), the original text is randomly corrupted with span corruption (marked as [S0], [S1], etc, ) and then token-level random masking (marked as [M]). A small auxiliary generator model $G$ is trained to recover [M] only. The resulting text is then fed into the T5 discriminator $D$, whose encoder component learns to predict at every position whether the token is a replaced one, while its decoder component learns to fill in the ground truth token as in standard span corruption.
  • Figure 2: SpacTor($\tau$) performances on SuperGLUE, SQuAD and CNN/DailyMail with respect to pre-training FLOPs. Here, we include SpacTor(250K) and SpacTor(120K) where the second pre-training stage (using the span corruption objective only) starts at 250K and 120K training steps respectively. The plots for the remaining tasks are presented in Appendix \ref{['app:score-flops-plot']}.
  • Figure 3: Average score on downstream tasks ($y$-axis) when continuously fine-tuning along the pre-training checkpoints ($x$-axis). The error band illustrates the min-max range over 5 independent runs.
  • Figure 4: (Left) Validation loss curve for baseline and SpacTor($\infty$). (Right) Validation cross-entropy loss differences between baseline and SpacTor($\infty$) evaluated with encoder input $X_{\mathrm{c}}$. The dashed line is the linear regression fits to the data starting at iteration 120K.
  • Figure 5: SpacTor performances on GLUE, Rainbow, BBH and MMLU with respect to pre-training FLOPs for T5-Base model.