Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting

Yunhun Nam; Jaehyung Kim; Jongheon Jeong

Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting

Yunhun Nam, Jaehyung Kim, Jongheon Jeong

TL;DR

Learning-from-the-Undesirable (LfU) tackles overfitting and forgetting when fine-tuning language models with limited data by enforcing representation-level consistency between the original model and an auxiliary model exposed to an undesirable update. This is achieved by augmenting the model with a low-rank LoRA or a representation-st steering component, performing a one-step gradient ascent to push toward undesirable behavior, and penalizing divergence in internal representations across all layers via a mean-squared error loss. The resulting objective, $\ell_{\text{LfU}}(\boldsymbol{\theta}, \boldsymbol{\theta}_{\text{aux}}) = \ell_{\text{SFT}}(\boldsymbol{\theta}) + \lambda \cdot \ell_{\text{cons.}}(\boldsymbol{\theta}, \boldsymbol{\theta}_{\text{aux}})$, regularizes fine-tuning to preserve general capabilities while enabling task specialization. Empirical results across single-task and multi-task settings show that LfU improves in-domain gain (e.g., up to $+16.8\%$ on math tasks) and enhances robustness to prompt variations and adversarial fine-tuning, with RepS offering a lightweight, faster variant that maintains competitive performance. Overall, LfU provides a practical, scalable approach to robust LM adaptation that maintains pretrained knowledge and improves generalization across diverse downstream tasks.

Abstract

Language models (LMs) are often adapted through supervised fine-tuning (SFT) to specialize their capabilities for downstream tasks. However, in typical scenarios where the fine-tuning data is limited, e.g., compared to pre-training, SFT can lead LMs to overfit, causing them to rely on spurious patterns within the target task or to compromise other broadly useful capabilities as a side effect of narrow specialization. In this paper, we propose Learning-from-the-Undesirable (LfU), a simple yet effective regularization scheme for SFT to mitigate overfitting issues when fine-tuning LMs with limited data. Specifically, we aim to regularize the fine-tuning process to favor solutions that are resilient to "undesirable" model updates, e.g., gradient ascent steps that steer the model toward undesirable behaviors. To this end, we propose a novel form of consistency regularization that directly aligns internal representations of the model with those after an undesirable update. By leveraging representation-level data augmentation through undesirable updates, LfU effectively promotes generalization under limited data. Our experiments on diverse LM downstream tasks show that LfU serves as an effective prior that enhances adaptability while preserving pretrained knowledge. For example, our LM from LfU achieves a 16.8% average improvement on math tasks compared to vanilla SFT on the same dataset, where the latter even leads to degraded performance on those tasks. Furthermore, LfU exhibits improved robustness to prompt variations, e.g., yielding a 92.1% lower standard deviation in output performances compared to SFT, highlighting its versatile effects.

Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting

TL;DR

Abstract

Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)