Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Tianjian Li; Haoran Xu; Philipp Koehn; Daniel Khashabi; Kenton Murray

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, Kenton Murray

TL;DR

This work proposes Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data that provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work.

Abstract

Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50% of noise is added to the data.

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 6 figures, 8 tables, 1 algorithm.

Introduction
Background and Motivation
Error Norm Truncation
Case Studies
Experiments
Setup
Robustness Results
Train-from-Scratch Results
Fine-Tuning Results
Related Works
Conclusion and Limitations
Equivalence of Loss and Error $\ell_1$ Norm
Additional Related Works
Tasks, Model Sizes, and Hyper-Parameters
Algorithm Pseudocode
...and 3 more sections

Figures (6)

Figure 1: An motivating example of using the error norm for data quality estimation. All three examples have equal loss because they assign the same probability to the ground truth token. The skewness of the distribution of non-target tokens differentiates between the case when the context has high entropy with multiple possible continuations (example 1), when the model is at the beginning of training and is incompetent in making a prediction (example 2) and the case when the data is an error (example 3). Truncating high loss removes all three examples whereas truncating high $\ell_2$ error norm only removes the third erroneous example.
Figure 2: Examples of natural data noise that harms training. Left: summarization example from the XLSUM hasan-etal-2021-xl dataset where details in the summary (highlighted in red) cannot be inferred from the input text, which might cause the model to hallucinate facts in generating a summary. Right: Translation examples from opus-100 zhang-etal-2020-improving, IWSLT 14 iwslt-2014 and WMT 17 bojar-EtAl:2017:WMT1, where details in the translation (highlighted in red) cannot be traced back to the source text (example 1 and 3) or requires the model to perform metric conversion (example 3).
Figure 3: The training dynamics of pre-training GPT2-large on WikiText-103. The plot shows the error norm for the largest 10% of data in each mini-batch. Initially, all error norms are close to 1, indicating the model uniformly assigns tiny probabilities to all target tokens. After the model is warmed up, it begins to detect data noise by assigning large error norms.
Figure 4: Distributions of negative log-likelihood loss and error $\ell_2$ norm of clean and noisy data, evaluated by a pre-trained BART-large model. Error norm clearly distinguishes between clean and noisy data.
Figure 5: Average BLEU results of 4 translation directions En-{De, Fr, It, Es} from the opus-100 dataset with a fraction of sentences being truncated according to loss, error norm, and randomly truncated. Truncating high error norm sentences achieves the best performance at all truncation fractions.
...and 1 more figures

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

TL;DR

Abstract

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)