Table of Contents
Fetching ...

Information Guided Regularization for Fine-tuning Language Models

Mandar Sharma, Nikhil Muralidhar, Shengzhe Xu, Raquib Bin Yousuf, Naren Ramakrishnan

TL;DR

This work investigates how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens and devise a novel approach to dropout for improved model regularization and better downstream generalization.

Abstract

The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.

Information Guided Regularization for Fine-tuning Language Models

TL;DR

This work investigates how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens and devise a novel approach to dropout for improved model regularization and better downstream generalization.

Abstract

The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
Paper Structure (15 sections, 6 equations, 8 figures, 3 tables)

This paper contains 15 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The 3D loss landscape geometry of the pretrained $BERT_{BASE}$ model when: (a) all model parameters are perturbed, (b) 50% of the model parameters with the highest Fisher scores are perturbed, and (c) 50% of the model parameters with the lowest Fisher scores are perturbed. Plot (b) resembles the typical sharp minimizers that generalize poorly compared to the wide minimizers of base plot (a) and control plot (c).
  • Figure 2: The loss landscape geometry of the pretrained $BERT_{BASE}$ model when: (a) & (b) $n$ % of parameters with the highest Fisher scores are perturbed, (c) $n$ % of randomly chosen parameters are perturbed. Plot (a) showcases worsening minimas as we narrow down on the parameters deemed essential by the Fisher score. Plot (b) showcases that the worsening stops and the landscape remains consisten as the we take $\leq$ 40% of the highest scoring parameters. Plot (c) acts as the control plot.
  • Figure 3: The sorted Fisher scores (normalized) for BERT, GPT2, and T5 based their (a) model parameters and (b) model layers (aggregated) - see appendix § A.1 and § A.2 for the raw (non-normalized) plots. From (a) we observe that a fraction of LM parameters are often attributed the highest Fisher scores. Similarly, (b) provides an aggregate view showcasing how a minority of the LM layers hold the majority of training information. Please see appendix § A.3 for the unsorted layer-wise Fisher score distributions of these models.
  • Figure 4: The performance of Guided dropout (our regularizer) vs standardized baselines on fine-tuning $BERT_{BASE}$ on decreasing cuts of the training datasets for (a) MRPC (b) STS-B (c) RTE and (d) CoLA. Each data point in the grid is an average across 5 random restarts, as shown by the boxplots. Guided dropout yields consistently better results for all the tasks, specially under data paucity.
  • Figure 5: When LM parameters are sorted based on their increasing Fisher scores, we observe a that only an exceedingly tiny fraction of the model parameters have significantly high scores. This observation is consistent for all popular transformer architectures: encoder only (BERT with 0.001% of parameters having $\geq$ 0.01 score), decoder only (GPT2 with 0.02% of parameters having $\geq$ 0.01 score), and encoder-decoder models (T5 with 0.0009% of parameters having $\geq$ 0.01 score).
  • ...and 3 more figures