Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Linyuan Gong; Chenyan Xiong; Xiaodong Liu; Payal Bajaj; Yiqing Xie; Alvin Cheung; Jianfeng Gao; Xia Song

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Linyuan Gong, Chenyan Xiong, Xiaodong Liu, Payal Bajaj, Yiqing Xie, Alvin Cheung, Jianfeng Gao, Xia Song

TL;DR

This work tackles zero-shot generalization for text-to-text transformers by introducing METRO-T0, a METRO-style pretraining regime that uses ELECTRA-like model-generated signals (RTD and CLM) to pretrain a T5-like encoder-decoder. By redesigning the pretraining objectives, masking pattern, and architectural details, METRO-T0 achieves strong prompt-based results with far fewer parameters than large baselines, including competition with GPT-3 and T0-11B on T0 Eval and MMLU, respectively, using only about $8\%$ of GPT-3's $175$B parameters. Ablations show that an all-tokens masked decoding target, encoder-side RTD, and i.i.d. masking are critical for stability and generalization, while METRO-style pretraining yields more efficient learning and balanced parameter usage. The findings suggest practical, compute-efficient pathways for improving zero-shot capabilities in large language models and provide insights into how model-generated signals affect neural activation and parameter sensitivity.

Abstract

This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5. We study various designs to pretrain T5 using an auxiliary model to construct more challenging token replacements for the main model to denoise. Key aspects under study include the decoding target, the location of the RTD head, and the masking pattern. Based on these studies, we develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks. METRO-T0 outperforms all similar-sized baselines on prompted NLP benchmarks, such as T0 Eval and MMLU, and rivals the state-of-the-art T0-11B model with only 8% of its parameters. Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity. The code and model checkpoints are available at https://github.com/gonglinyuan/metro_t0.

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

TL;DR

of GPT-3's

B parameters. Ablations show that an all-tokens masked decoding target, encoder-side RTD, and i.i.d. masking are critical for stability and generalization, while METRO-style pretraining yields more efficient learning and balanced parameter usage. The findings suggest practical, compute-efficient pathways for improving zero-shot capabilities in large language models and provide insights into how model-generated signals affect neural activation and parameter sensitivity.

Abstract

Paper Structure (50 sections, 6 equations, 8 figures, 9 tables)

This paper contains 50 sections, 6 equations, 8 figures, 9 tables.

Introduction
Related Work
Prompt-based learning with language models.
Efficient pretraining using model-generated signals.
Preliminaries
Text-to-Text Transformers
T5 Pretraining.
Text-to-Text Formulation of Downstream Tasks.
Text-to-Text Prompt-Finetuning.
Model-Generated Pretraining Signals
Replace token detection (RTD)
Corrective language modeling (CLM)
Method
Pretraining Objective Design
Decoding Target.
...and 35 more sections

Figures (8)

Figure 1: Prompt learning results of METRO-T0 versus our T0 baseline and T0$_\textsc{3B}$ by sanh2022multitask on 4 tasks in the T0 Eval benchmark. Each point denotes the accuracy using one prompt template, except that the median accuracy over all templates of T0$_\textsc{3B}$ is indicated by the blue point. The plots of other tasks are in \ref{['sec:appendix_full_t0']}.
Figure 2: The architecture of METRO-T0 during pretraining using BERT as the auxiliary model to generate signals.
Figure 3: Pretraining behaviors of different designs.
Figure 4: Comparison of the pretraining efficiency of T5 and METRO-T5. Each point shows the performance of a T0++/METRO-T0++ model finetuned from a checkpoint at 500k/1M/2M pretraining steps. The x-axis displays the pretraining wall time, reflecting computational cost, as all models were pretrained in the identical environment.
Figure 5: Per-task performance of T0++ (pretrained for 2M steps) and METRO-T0++ (pretrained for only 500k steps) on T0 Eval. The error bars are calculated using the model's performance across prompt templates.
...and 3 more figures

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

TL;DR

Abstract

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (8)