Table of Contents
Fetching ...

Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

Sai Ashish Somayajula, Youwei Liang, Abhishek Singh, Li Zhang, Pengtao Xie

TL;DR

This work tackles the instability and overfitting that arise when finetuning pretrained language models on low-resource data. It introduces an attention-guided weight mixup that replaces discrete subnetwork selection with a continuous interpolation between task-weighted and pretrained parameters, controlled by per-parameter attention, and optimizes these components via a bilevel optimization framework on two data splits. Through extensive GLUE-based experiments across multiple PLMs, the method achieves superior accuracy and stability in low-resource settings and demonstrates robust generalization beyond vanilla finetuning and standard regularization baselines. The approach offers a principled path to more reliable low-resource NLP adaptation, with potential extensions to lifelong and multilingual learning scenarios.

Abstract

Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets.

Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

TL;DR

This work tackles the instability and overfitting that arise when finetuning pretrained language models on low-resource data. It introduces an attention-guided weight mixup that replaces discrete subnetwork selection with a continuous interpolation between task-weighted and pretrained parameters, controlled by per-parameter attention, and optimizes these components via a bilevel optimization framework on two data splits. Through extensive GLUE-based experiments across multiple PLMs, the method achieves superior accuracy and stability in low-resource settings and demonstrates robust generalization beyond vanilla finetuning and standard regularization baselines. The approach offers a principled path to more reliable low-resource NLP adaptation, with potential extensions to lifelong and multilingual learning scenarios.

Abstract

Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets.
Paper Structure (38 sections, 12 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 38 sections, 12 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of our proposed method: learning the task weights $W$ and the attention parameter $\alpha$ in a bilevel optimization framework. The final network weight $\tilde{W}$ is a combination of the pretrained weight $W_0$ and the task weight $W$ via the learned attention parameter $\alpha$.
  • Figure 2: Averaged performance across CoLA, RTE, STSB, and MRPC datasets for Vanilla, Prompt Tuning, Prefix-Tuning, LoRA, and our method in low-resource scenarios with 500 and 1000 training instances. Results on each dataset are presented in Table \ref{['tab:lora']}.
  • Figure 3: Comparison of our method with Vanilla, $\text{CHILD-TUNING}_D$, and DPS dense method on the QNLI and SST-2 datasets with sufficient training examples. The bar plots represent the mean accuracy from ten random seeds, and error bars denote the standard deviation.