Table of Contents
Fetching ...

Intelligent Learning Rate Distribution to reduce Catastrophic Forgetting in Transformers

Philip Kenneweg, Alexander Schulz, Sarah Schröder, Barbara Hammer

TL;DR

The paper tackles catastrophic forgetting during fine-tuning of transformer-based NLP models by questioning the traditional flat learning-rate approach. It introduces $BERTcL$, an AutoML-driven method that optimizes layerwise learning-rate distributions via Bayesian optimization, and extends it to a combined distribution to generalize to unseen data, using a two-stage learning-rate search over dataset pairs and a geometric-mean fusion across pairs. On GLUE benchmarks, $p_o$ improves by about $2.4\%$ with modest $p_s$ loss, and the combined distribution yields up to about $5\%$ gains on unseen tasks, outperforming flat LR and EWC in many cases. The method preserves the transformer architecture, is applicable to other encoders/decoders, and provides a practical AutoML pathway for mitigating catastrophic forgetting in fine-tuning scenarios.

Abstract

Pretraining language models on large text corpora is a common practice in natural language processing. Fine-tuning of these models is then performed to achieve the best results on a variety of tasks. In this paper, we investigate the problem of catastrophic forgetting in transformer neural networks and question the common practice of fine-tuning with a flat learning rate for the entire network in this context. We perform a hyperparameter optimization process to find learning rate distributions that are better than a flat learning rate. We combine the learning rate distributions thus found and show that they generalize to better performance with respect to the problem of catastrophic forgetting. We validate these learning rate distributions with a variety of NLP benchmarks from the GLUE dataset.

Intelligent Learning Rate Distribution to reduce Catastrophic Forgetting in Transformers

TL;DR

The paper tackles catastrophic forgetting during fine-tuning of transformer-based NLP models by questioning the traditional flat learning-rate approach. It introduces , an AutoML-driven method that optimizes layerwise learning-rate distributions via Bayesian optimization, and extends it to a combined distribution to generalize to unseen data, using a two-stage learning-rate search over dataset pairs and a geometric-mean fusion across pairs. On GLUE benchmarks, improves by about with modest loss, and the combined distribution yields up to about gains on unseen tasks, outperforming flat LR and EWC in many cases. The method preserves the transformer architecture, is applicable to other encoders/decoders, and provides a practical AutoML pathway for mitigating catastrophic forgetting in fine-tuning scenarios.

Abstract

Pretraining language models on large text corpora is a common practice in natural language processing. Fine-tuning of these models is then performed to achieve the best results on a variety of tasks. In this paper, we investigate the problem of catastrophic forgetting in transformer neural networks and question the common practice of fine-tuning with a flat learning rate for the entire network in this context. We perform a hyperparameter optimization process to find learning rate distributions that are better than a flat learning rate. We combine the learning rate distributions thus found and show that they generalize to better performance with respect to the problem of catastrophic forgetting. We validate these learning rate distributions with a variety of NLP benchmarks from the GLUE dataset.
Paper Structure (13 sections, 1 equation, 2 figures, 4 tables)

This paper contains 13 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Visualizations of sentences of the SST2 dataset embedded with BERT and projected to 2-D with TSNE.
  • Figure 2: Combined learning rate as determined by the hyperparameter optimization process over the dataset shift experiments. X-Axis denotes position of learning rate in the transformer architecture as described in section \ref{['chap:searchspace']}. Lower numbers indicate earlier layers in the transformer. Y-Axis denotes the learning rate (log scale).