E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

Qihuang Zhong; Liang Ding; Juhua Liu; Bo Du; Dacheng Tao

E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao

TL;DR

The paper tackles the under-exploited encoder in seq2seq pretrained language models by introducing encoding-enhanced seq2seq pretraining (E2S2). It adds two encoder-side self-supervisions—a locally denoising objective $ L_{de}$ and a global contrastive objective $ L_{cl}$—to the standard reconstruction losses, forming the overall objective $ L_{all}= L^*_{nll}+ L_{nll}+\lambda_{de}\nL_{de}+\lambda_{cl}\nL_{cl}$. Empirical results on GLUE, CoLA, CoNLL2014, CNN/DM, XSum, SAMSum, and various dialogue datasets show consistent improvements over vanilla seq2seq pretraining (e.g., +1.1% average on GLUE, +2.3% on CoLA, +1.75% $F_{0.5}$ on CoNLL2014) and demonstrate compatibility with backbones like BART and T5. Analyses indicate that E2S2 enhances encoder representations across surface, syntactic, and semantic aspects, explaining the downstream gains in both understanding and generation tasks. The work suggests a general, model-agnostic path for elevating seq2seq pretraining via encoder-focused self-supervision and motivates future exploration of automated prompts and scaling to larger models.

Abstract

Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained language models, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e., syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g., BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq language model pretraining.

E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

TL;DR

and a global contrastive objective

—to the standard reconstruction losses, forming the overall objective

. Empirical results on GLUE, CoLA, CoNLL2014, CNN/DM, XSum, SAMSum, and various dialogue datasets show consistent improvements over vanilla seq2seq pretraining (e.g., +1.1% average on GLUE, +2.3% on CoLA, +1.75%

on CoNLL2014) and demonstrate compatibility with backbones like BART and T5. Analyses indicate that E2S2 enhances encoder representations across surface, syntactic, and semantic aspects, explaining the downstream gains in both understanding and generation tasks. The work suggests a general, model-agnostic path for elevating seq2seq pretraining via encoder-focused self-supervision and motivates future exploration of automated prompts and scaling to larger models.

Abstract

Paper Structure (37 sections, 4 equations, 5 figures, 10 tables)

This paper contains 37 sections, 4 equations, 5 figures, 10 tables.

Introduction
Related Works
Pretrained Language Models
Self-supervision Learning in PLMs
Methodology
Background
Sequence-to-Sequence Pretraining
Contrastive Learning
Prompt-based Learning
Encoding-Enhanced Sequence-to-Sequence Pretraining
Locally Denoising the Perturbed Sentences
Globally Learning Better Sentence Embeddings
Overall pretraining objective
Experiments
Tasks and Datasets
...and 22 more sections

Figures (5)

Figure 1: Left: comparison of downstream performance decrease when removing encoder and decoder layers of seq2seq PLM (i.e., BART lewis2020bart) respectively. Right: the average rate of activated neurons of FFN layers in the encoder and decoder, where the higher rate denotes the FFN trained more sufficiently. Takeaway:the encoder plays a key role in the seq2seq framework, but its training is less sufficient than that of decoder.
Figure 2: The schematic comparison of our E2S2 with the vanilla seq2seq pretraining scheme. In general, (a) is the original seq2seq pretraining scheme and (b) is our proposed E2S2 strategy, which integrates two self-supervised objectives on the encoder side, i.e., denoising objective $\mathcal{L}_{de}$ and contrastive objective $\mathcal{L}_{cl}$. Notably, $\tilde{x}$ is the sentence corrupted with text infilling noise, $\tilde{x}^*$ is corrupted with extra noises (e.g., shuffling and random replacement), $n$ is the number of sentences in a mini-batch, $\theta_{enc}$ and $\theta_{all}$ denote the parameters of the encoder and full model respectively. Specifically, (c) and (d) provide a more detailed illustration for each objective. In (c), the token sequence is obtained by corrupting the original input tokens $\{t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7\}$, where the boxes in blue denote the shuffled tokens and the yellow box denotes the randomly replaced token. $m$ and $c$ are the length of token sequence and the classes of noise type (1 for shuffle, 2 for random replacement, and 0 for others) respectively. $y$ and $p$ are the ground-truths and predictions of the noise type.
Figure 3: Illustration of pretraining details. The left y-axis is the overall training loss, while the right y-axis is the average performance on the dev sets of GLUE. The loss curves of contrastive and denoising objectives are illustrated in the inserted figure.
Figure 4: Results on GEC task (CoNLL2014).
Figure 5: Parameter analyses of $\lambda_{de}$ and $\lambda_{cl}$. We train the E2S2-BART_JoBa models with different coefficient combinations, and evaluate on the dev sets of GLUE benchmark.

E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

TL;DR

Abstract

E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)