Table of Contents
Fetching ...

Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

Branislav Pecher, Jan Cegin, Robert Belanec, Jakub Simko, Ivan Srba, Maria Bielikova

TL;DR

This paper tackles instability in fine-tuning pre-trained language models under limited labelled data by introducing Delayed Ensemble with Noisy Interpolation (DENI), which merges delayed ensemble construction with repeated noisy interpolation to reduce variance while keeping resource use modest. DENI leverages ensemble benefits, noise regularisation, and model interpolation, enabling a single model to approximate an ensemble efficiently. Across seven text-classification datasets, three base models, full fine-tuning and three PEFT methods, DENI consistently improves mean performance and reduces result variance compared to strong baselines, often at a fraction of the computational cost; data augmentation can further boost performance in PEFT settings. The findings suggest that optimization-focused mitigation strategies, especially those combined with data augmentation in low-resource regimes, offer significant practical gains for robust fine-tuning.

Abstract

While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples, which typically results in increased computational costs. We propose a new mitigation strategy, called Delayed Ensemble with Noisy Interpolation (DENI), that leverages the strengths of ensembling, noise regularisation and model interpolation, while retaining computational efficiency. We compare DENI with 9 representative mitigation strategies across 3 models, 4 tuning strategies and 7 text classification datasets. We show that: 1) DENI outperforms the best performing mitigation strategy (Ensemble), while using only a fraction of its cost; 2) the mitigation strategies are beneficial for parameter-efficient fine-tuning (PEFT) methods, outperforming full fine-tuning in specific cases; and 3) combining DENI with data augmentation often leads to even more effective instability mitigation.

Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

TL;DR

This paper tackles instability in fine-tuning pre-trained language models under limited labelled data by introducing Delayed Ensemble with Noisy Interpolation (DENI), which merges delayed ensemble construction with repeated noisy interpolation to reduce variance while keeping resource use modest. DENI leverages ensemble benefits, noise regularisation, and model interpolation, enabling a single model to approximate an ensemble efficiently. Across seven text-classification datasets, three base models, full fine-tuning and three PEFT methods, DENI consistently improves mean performance and reduces result variance compared to strong baselines, often at a fraction of the computational cost; data augmentation can further boost performance in PEFT settings. The findings suggest that optimization-focused mitigation strategies, especially those combined with data augmentation in low-resource regimes, offer significant practical gains for robust fine-tuning.

Abstract

While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples, which typically results in increased computational costs. We propose a new mitigation strategy, called Delayed Ensemble with Noisy Interpolation (DENI), that leverages the strengths of ensembling, noise regularisation and model interpolation, while retaining computational efficiency. We compare DENI with 9 representative mitigation strategies across 3 models, 4 tuning strategies and 7 text classification datasets. We show that: 1) DENI outperforms the best performing mitigation strategy (Ensemble), while using only a fraction of its cost; 2) the mitigation strategies are beneficial for parameter-efficient fine-tuning (PEFT) methods, outperforming full fine-tuning in specific cases; and 3) combining DENI with data augmentation often leads to even more effective instability mitigation.
Paper Structure (26 sections, 1 equation, 39 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 1 equation, 39 figures, 6 tables, 1 algorithm.

Figures (39)

  • Figure 1: Repeating BERT fine-tuning multiple times without any mitigation leads to significant performance variance. Using DENI for randomness mitigation, the variance is reduced and performance increased.
  • Figure 2: An illustrative example of the Delayed Ensemble with Noisy Interpolation (DENI) method that mitigates model performance instability of model fine-tuning with limited data. DENI alters regular fine-tuning, using noise-adding, model aggregation, and ensembling to steer model(s) towards optimal parameter setup in the parameter space. In comparison to simple ensembling, the method requires only a fraction of computational resources.
  • Figure 3: Benefit of mitigation strategies for the different fine-tuning methods using BERT on SNIPS dataset. The benefit is calculated as difference to the mean performance of the Default baseline. The different mitigation strategies are beneficial for all fine-tuning methods, but with different overall benefit (e.g., Augment on IA3). In addition, the DENI method outperforms all mitigation strategies, leading to higher performance and lower deviation.
  • Figure 4: Mitigation effectiveness across different dataset sizes for the BERT model on TREC dataset. Benefit of mitigation strategies is higher on lower number of shots and gradually decrease with more shots.
  • Figure 5: The effect of hyperparameter setup on the mitigation effectiveness of the DENI method, based on hyperparameter search for the BERT model on TREC dataset. The hyperparameters that affect each other are grouped together.
  • ...and 34 more figures