Table of Contents
Fetching ...

Pretraining-Based Natural Language Generation for Text Summarization

Haoyu Zhang, Jianjun Xu, Ji Wang

TL;DR

This work introduces a BERT-based encoder-decoder architecture for abstractive text summarization that employs a novel two-stage decoding process: a draft stage generated with a left-context-only decoder, followed by a refine stage that masks each draft word and predicts refined tokens using BERT-facilitated context. The model incorporates a copy mechanism and a mixed objective that blends maximum likelihood with reinforcement learning to optimize ROUGE-based rewards, training end-to-end with shared decoder parameters. Evaluated on CNN/Daily Mail and NYT50, the approach achieves state-of-the-art results, demonstrating the utility of integrating pre-trained contextual encoders into both encoder and decoder roles for generation. The findings suggest that leveraging bi-directional context via a refine mechanism can significantly improve the fluency and informativeness of generated summaries, with potential applicability to broader natural language generation tasks.

Abstract

In this paper, we propose a novel pretraining-based encoder-decoder framework, which can generate the output sequence based on the input sequence in a two-stage manner. For the encoder of our model, we encode the input sequence into context representations using BERT. For the decoder, there are two stages in our model, in the first stage, we use a Transformer-based decoder to generate a draft output sequence. In the second stage, we mask each word of the draft sequence and feed it to BERT, then by combining the input sequence and the draft representation generated by BERT, we use a Transformer-based decoder to predict the refined word for each masked position. To the best of our knowledge, our approach is the first method which applies the BERT into text generation tasks. As the first step in this direction, we evaluate our proposed method on the text summarization task. Experimental results show that our model achieves new state-of-the-art on both CNN/Daily Mail and New York Times datasets.

Pretraining-Based Natural Language Generation for Text Summarization

TL;DR

This work introduces a BERT-based encoder-decoder architecture for abstractive text summarization that employs a novel two-stage decoding process: a draft stage generated with a left-context-only decoder, followed by a refine stage that masks each draft word and predicts refined tokens using BERT-facilitated context. The model incorporates a copy mechanism and a mixed objective that blends maximum likelihood with reinforcement learning to optimize ROUGE-based rewards, training end-to-end with shared decoder parameters. Evaluated on CNN/Daily Mail and NYT50, the approach achieves state-of-the-art results, demonstrating the utility of integrating pre-trained contextual encoders into both encoder and decoder roles for generation. The findings suggest that leveraging bi-directional context via a refine mechanism can significantly improve the fluency and informativeness of generated summaries, with potential applicability to broader natural language generation tasks.

Abstract

In this paper, we propose a novel pretraining-based encoder-decoder framework, which can generate the output sequence based on the input sequence in a two-stage manner. For the encoder of our model, we encode the input sequence into context representations using BERT. For the decoder, there are two stages in our model, in the first stage, we use a Transformer-based decoder to generate a draft output sequence. In the second stage, we mask each word of the draft sequence and feed it to BERT, then by combining the input sequence and the draft representation generated by BERT, we use a Transformer-based decoder to predict the refined word for each masked position. To the best of our knowledge, our approach is the first method which applies the BERT into text generation tasks. As the first step in this direction, we evaluate our proposed method on the text summarization task. Experimental results show that our model achieves new state-of-the-art on both CNN/Daily Mail and New York Times datasets.

Paper Structure

This paper contains 25 sections, 11 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Model Overview, N represents decoder layer number and L represents summary length.
  • Figure 2: Average ROUGE-L improvement on CNN/Daily mail test set samples with different golden summary length.