Table of Contents
Fetching ...

Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting

Haolin Chen, Philip N. Garner

TL;DR

This paper addresses catastrophic forgetting in parameter-efficient fine-tuning (PEFT) by adopting Bayesian transfer learning with Laplace-based priors to regularize fine-tuning. It formulates a Hessian-informed penalty that preserves pre-trained knowledge when fine-tuning shifts are differentiable, and compares diagonal (EWC/L2-SP) and Kronecker-factored (KFAC) Hessian approximations within LoRA-based updates. Across language modeling (GLUE and WikiText with OPT models) and speech synthesis (StyleTTS 2) tasks, Kronecker-factored regularization consistently improves pre-training knowledge preservation without harming fine-tuning performance, while diagonal methods are less effective, especially with larger fine-tuning data. The findings suggest a practical workflow for robust, parameter-efficient adaptation of large models, with KFAC offering superior knowledge retention at a higher computational cost.

Abstract

We are motivated primarily by the adaptation of text-to-speech synthesis models; however we argue that more generic parameter-efficient fine-tuning (PEFT) is an appropriate framework to do such adaptation. Nevertheless, catastrophic forgetting remains an issue with PEFT, damaging the pre-trained model's inherent capabilities. We demonstrate that existing Bayesian learning techniques can be applied to PEFT to prevent catastrophic forgetting as long as the parameter shift of the fine-tuned layers can be calculated differentiably. In a principled series of experiments on language modeling and speech synthesis tasks, we utilize established Laplace approximations, including diagonal and Kronecker-factored approaches, to regularize PEFT with the low-rank adaptation (LoRA) and compare their performance in pre-training knowledge preservation. Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker-factored approximation produces a better preservation of the pre-training knowledge than the diagonal ones.

Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting

TL;DR

This paper addresses catastrophic forgetting in parameter-efficient fine-tuning (PEFT) by adopting Bayesian transfer learning with Laplace-based priors to regularize fine-tuning. It formulates a Hessian-informed penalty that preserves pre-trained knowledge when fine-tuning shifts are differentiable, and compares diagonal (EWC/L2-SP) and Kronecker-factored (KFAC) Hessian approximations within LoRA-based updates. Across language modeling (GLUE and WikiText with OPT models) and speech synthesis (StyleTTS 2) tasks, Kronecker-factored regularization consistently improves pre-training knowledge preservation without harming fine-tuning performance, while diagonal methods are less effective, especially with larger fine-tuning data. The findings suggest a practical workflow for robust, parameter-efficient adaptation of large models, with KFAC offering superior knowledge retention at a higher computational cost.

Abstract

We are motivated primarily by the adaptation of text-to-speech synthesis models; however we argue that more generic parameter-efficient fine-tuning (PEFT) is an appropriate framework to do such adaptation. Nevertheless, catastrophic forgetting remains an issue with PEFT, damaging the pre-trained model's inherent capabilities. We demonstrate that existing Bayesian learning techniques can be applied to PEFT to prevent catastrophic forgetting as long as the parameter shift of the fine-tuned layers can be calculated differentiably. In a principled series of experiments on language modeling and speech synthesis tasks, we utilize established Laplace approximations, including diagonal and Kronecker-factored approaches, to regularize PEFT with the low-rank adaptation (LoRA) and compare their performance in pre-training knowledge preservation. Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker-factored approximation produces a better preservation of the pre-training knowledge than the diagonal ones.
Paper Structure (44 sections, 12 equations, 7 tables)