Table of Contents
Fetching ...

FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics

Yupei Du, Albert Gatt, Dong Nguyen

TL;DR

The paper tackles the robustness gap in fine-tuning large pre-trained language models by addressing the computational cost of dataset cartography. It demonstrates that training dynamics guiding data importance transfer across model sizes and pretraining methods are largely transferable, enabling efficient use of lightweight reference models. By proposing Fine-Tuning by transFerring Training dynamics (FTFT), the method achieves improved out-of-distribution robustness while cutting training costs by up to about 50% through aggressive early stopping and data-driven instance selection. The approach holds practical value for building robust NLP systems under distribution shifts and offers a scalable path for efficient robust fine-tuning. Limitations point to protocol optimization for reference selection, theoretical grounding of transfers, and extension beyond classification tasks.

Abstract

Despite the massive success of fine-tuning Pre-trained Language Models (PLMs), they remain susceptible to out-of-distribution input. Dataset cartography is a simple yet effective dual-model approach that improves the robustness of fine-tuned PLMs. It involves fine-tuning a model on the original training set (i.e. reference model), selecting a subset of important training instances based on the training dynamics, and fine-tuning again only on these selected examples (i.e. main model). However, this approach requires fine-tuning the same model twice, which is computationally expensive for large PLMs. In this paper, we show that (1) training dynamics are highly transferable across model sizes and pre-training methods, and that (2) fine-tuning main models using these selected training instances achieves higher training efficiency than empirical risk minimization (ERM). Building on these observations, we propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT). Compared with dataset cartography, FTFT uses more efficient reference models and aggressive early stopping. FTFT achieves robustness improvements over ERM while lowering the training cost by up to $\sim 50\%$.

FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics

TL;DR

The paper tackles the robustness gap in fine-tuning large pre-trained language models by addressing the computational cost of dataset cartography. It demonstrates that training dynamics guiding data importance transfer across model sizes and pretraining methods are largely transferable, enabling efficient use of lightweight reference models. By proposing Fine-Tuning by transFerring Training dynamics (FTFT), the method achieves improved out-of-distribution robustness while cutting training costs by up to about 50% through aggressive early stopping and data-driven instance selection. The approach holds practical value for building robust NLP systems under distribution shifts and offers a scalable path for efficient robust fine-tuning. Limitations point to protocol optimization for reference selection, theoretical grounding of transfers, and extension beyond classification tasks.

Abstract

Despite the massive success of fine-tuning Pre-trained Language Models (PLMs), they remain susceptible to out-of-distribution input. Dataset cartography is a simple yet effective dual-model approach that improves the robustness of fine-tuned PLMs. It involves fine-tuning a model on the original training set (i.e. reference model), selecting a subset of important training instances based on the training dynamics, and fine-tuning again only on these selected examples (i.e. main model). However, this approach requires fine-tuning the same model twice, which is computationally expensive for large PLMs. In this paper, we show that (1) training dynamics are highly transferable across model sizes and pre-training methods, and that (2) fine-tuning main models using these selected training instances achieves higher training efficiency than empirical risk minimization (ERM). Building on these observations, we propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT). Compared with dataset cartography, FTFT uses more efficient reference models and aggressive early stopping. FTFT achieves robustness improvements over ERM while lowering the training cost by up to .
Paper Structure (23 sections, 5 figures, 4 tables)

This paper contains 23 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Figure \ref{['fig:transferability_sizes_instance_level']}: Consistency across different sizes of DeBERTaV3 on NLI. The numbers are the percentages (0–1) of ambiguous training instances shared by two models. Training dynamics are transferable across different model sizes: the percentages between models of different sizes are only slightly smaller than those between models of different random seeds (shown as superscript). Figures \ref{['fig:speed_gain_cad']} & \ref{['fig:speed_gain_dynahate']}: Performance on HSD when training the main model ($\text{DeBERTaV3}_{\text{Large}}$) using different numbers of training steps. We experimented with different lengths of training (max training steps), and different methods (using ERM and DM). Training with data instances selected by DMs achieves consistently higher training speed than ERM: for datasets on which training with DM achieves either better (OOD datasets, right) or worse (ID datasets, left) performance, models trained with DM outperform ERM with fewer training steps (i.e. the early stage of training, the leftmost part of the x-axis).
  • Figure 2: Change of median $p_{\text{true}}$: ineffective reference models ($\text{ELECTRA}_{\text{Small}}$ and TinyBERT) are unable to fit difficult training instances, making easy instances being identified as ambiguous.
  • Figure 3: Performance when training the main model ($\text{DeBERTaV3}_{\text{Large}}$) using different numbers of training steps across different checkpoints. We experimented with different lengths of training (max training steps), and different methods (using ERM and DM). Training with data instances selected by DMs achieves consistent higher training speed than ERM: for datasets on which training with DM achieves either better or worse performance, models trained with DM outperform ERM with reduced training steps (i.e. the early stage of training, the leftmost part of the x-axis).
  • Figure 4: Our results of $\text{DeBERTaV3}$ as the main model on HSD (measured by F1 scores), which consist of four parts: (1) Baselines: $\text{DeBERTaV3}$ of different sizes trained using ERM, $\text{DeBERTaV3}_{\text{Large}}$ trained using ERM with early stopping (ERM(ES)), and $\text{DeBERTaV3}_{\text{Large}}$ trained using random DM (random 33% of the training data); (2) Training dynamics transferability across different sizes: training $\text{DeBERTaV3}_{\text{Large}}$ as the main model, using DMs constructed by different sizes of $\text{DeBERTaV3}$ as reference models; (3) Training dynamics transferability across different pretraining methods: training $\text{DeBERTaV3}_{\text{Large}}$ as the main model, using DMs constructed by different pretraining methods as reference models, including $\text{ELECTRA}_{\text{Small}}$, $\text{ELECTRA}_{\text{Base}}$, $\text{TinyBERT}$, and $\text{RoBERTa}$; (4) FTFT: training $\text{DeBERTaV3}_{\text{Large}}$ using our approach FTFT, with DMs constructed by different reference models, as well as aggressive early stopping. Ori/Pert and R2--R4 in DynaHate refer to different rounds of collected Original and Perturbed data. Compute refers to the relative training computational cost compared to training $\text{DeBERTaV3}_{\text{Large}}$ using ERM. We observe that: (1) Training dynamics are transferrable across different sizes and pretraining methods, as constructing DMs using different reference models results in comparable performance; (2) FTFT achieves consistent robustness improvements over ERM, while maintaining or lowering the training cost. (3) FTFT enhances efficiency more when the optimal length of training is longer (ERM only trains 1.6k steps on CAD).
  • Figure 5: We use reference models of different capabilities (measured by their ERM F1 scores) to construct DMs, and use them to train main models. The gray-shaded rows are the (1) main models in unsuccessful transfers and (2) their corresponding reference models. The orange-shaded rows are the rows with very high standard deviations: we observe that fine-tuning $\text{ELECTRA}_{\text{Large}}$ on small datasets like CAD is very unstable, and often produces failed runs. Following mosbach2021on we removed the runs with ID performance worse than the majority classifier: however, there are still some runs with slightly better ID performance than the majority classifier, but with diverged loss or fluctuating loss after the first a few training steps (i.e. the loss curve going up, or the loss curve being almost flat), and are significantly worse than the other runs. HTML]fc7f03These "almost failed runs" cause the high standard deviations in the orange-shaded rows. We therefore exclude these runs in our analyses in §\ref{['subsec:how_efficient_can_we_be']}. Successful transfer requires the reference model to be reasonably strong: reference models with clearly worse ID performance lead to degraded OOD performance for the main models.