Table of Contents
Fetching ...

LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

TL;DR

LEVI targets the problem that fine-tuning pre-trained models can underperform on unseen distributions due to inherent flaws in both pre-trained representations and fine-tuning data. It introduces a layer-wise ensemble that jointly leverages a fixed pre-trained model and a small task-specific model, connecting their outputs through adapting layers and optimizing a shared loss across multiple intermediate representations. The approach demonstrates strong improvements in OOD generalization across language and vision benchmarks, while maintaining competitive ID performance and offering efficiency benefits, including compatibility with LoRA. This work underscores the value of incorporating trained-from-scratch views to mitigate pre-training limitations and enhance robustness in real-world deployment scenarios.

Abstract

Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI (Layer-wise Ensemble of different VIews), where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving its efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.

LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

TL;DR

LEVI targets the problem that fine-tuning pre-trained models can underperform on unseen distributions due to inherent flaws in both pre-trained representations and fine-tuning data. It introduces a layer-wise ensemble that jointly leverages a fixed pre-trained model and a small task-specific model, connecting their outputs through adapting layers and optimizing a shared loss across multiple intermediate representations. The approach demonstrates strong improvements in OOD generalization across language and vision benchmarks, while maintaining competitive ID performance and offering efficiency benefits, including compatibility with LoRA. This work underscores the value of incorporating trained-from-scratch views to mitigate pre-training limitations and enhance robustness in real-world deployment scenarios.

Abstract

Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI (Layer-wise Ensemble of different VIews), where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving its efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.
Paper Structure (48 sections, 3 theorems, 3 equations, 8 figures, 18 tables)

This paper contains 48 sections, 3 theorems, 3 equations, 8 figures, 18 tables.

Key Result

Lemma 1

ERM-based model training can be affected by spurious features in the training data. Let the training data $D$ has the input features ${\textnormal{x}} = [x_1, x_2]$, where $x_1$ is a spurious feature and $x_2$ is a transferable feature. When we train a model with randomly initialized weights ${\bm{w

Figures (8)

  • Figure 1: When both pre-trained features and fine-tuning data have inherent problems like spurious features, they can jointly affect the OOD generalization ability of a resulting fine-tuned model. Indeed, we observe that the OOD performance of the fine-tuned model is worse (red color in the table) than both the pre-trained and trained-from-scratch (i.e., randomly initialized then trained on fine-tuning data) models, where we 1) fine-tune a pre-trained language model (T5x) on various downstream tasks (movie and product recommendations) and 2) test on 20 distribution shifts (e.g., subpopulation and time shifts). To address this issue, our key idea is to separately leverage different views from a pre-trained model and a trained-from-scratch model via layer-wise ensemble to reduce the impact of problematic features while preserving necessary ones. Compared to the vanilla ensemble of such two complementing models (fourth column), LEVI further improves both ID and OOD performances while preserving training and inference efficiencies -- see framework details in Sec. \ref{['sec:framework']}.
  • Figure 2: Toy example of a duck classification scenario.
  • Figure 3: LEVI architecture of using a layer-wise ensemble.
  • Figure 4: Effects of intermediate layers on ID (blue) and OOD (red) performances, where lower RMSE is better. We report the average results on MovieLens and Amazon Review using T5x.
  • Figure 5: Diabetic Retinopathy (Medical) data examples for the in-distribution. The images are from TensorFlow Datasets TFDS.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Lemma 2
  • Corollary 3
  • Remark 4
  • Remark 5: Using a Fine-tuned Large Model
  • Remark 6: Compatibility with Efficient Training Approaches