Table of Contents
Fetching ...

Layer-wise Regularized Dropout for Neural Language Models

Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li, Xiping Hu

TL;DR

LR-Drop introduces a layer-wise regularized dropout mechanism for Transformer-based language models, using two dropout-induced sub-models to enforce consistency across hidden states, multi-head attention matrices, and output distributions. The method combines hidden-state, attention, and output regularizations with a standard cross-entropy term, forming a total objective that improves generalization without adding parameters. Across 15 datasets spanning NLU, NMT, and abstractive summarization, LR-Drop consistently outperforms baselines and RD-R-Drop variants, achieving near or state-of-the-art performance and demonstrating robustness to training-set size and model type. The approach also shows that LR-Drop yields flatter loss landscapes, supporting improved generalization in practice.

Abstract

Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.

Layer-wise Regularized Dropout for Neural Language Models

TL;DR

LR-Drop introduces a layer-wise regularized dropout mechanism for Transformer-based language models, using two dropout-induced sub-models to enforce consistency across hidden states, multi-head attention matrices, and output distributions. The method combines hidden-state, attention, and output regularizations with a standard cross-entropy term, forming a total objective that improves generalization without adding parameters. Across 15 datasets spanning NLU, NMT, and abstractive summarization, LR-Drop consistently outperforms baselines and RD-R-Drop variants, achieving near or state-of-the-art performance and demonstrating robustness to training-set size and model type. The approach also shows that LR-Drop yields flatter loss landscapes, supporting improved generalization in practice.

Abstract

Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.
Paper Structure (20 sections, 9 equations, 2 figures, 6 tables)

This paper contains 20 sections, 9 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The proposed LR-Drop to regularize Transformer-based PLM. The left figure shows that one input will go through the two different sub-models produced by dropout twice and obtain two distributions $P_1$ and $P_2$. The right one shows a Transformer-layer regularization containing hidden states regularization MHA regularization.
  • Figure 2: 2D (left) and 3D (right) visualization of loss function minima selected by BERT-base with standard training (ST) and LR-Drop on SST-2 dataset.