Table of Contents
Fetching ...

Exploring Representation Invariance in Finetuning

Wenqiang Zu, Shenghao Xie, Hao Chen, Zhiqiang Chen, Liwen Hu, Yuanhao Xi, Yiming Liang, Junliang Ye, Bo Lei, Tiejun Huang, Guoqi Li, Lei Ma

TL;DR

This paper tackles the problem that finetuning foundation models on cross-domain, low-resource tasks can erode pretrained representations and generalization. It introduces Representation Invariance FineTuning (RIFT), a regularization that enforces orthogonal invariance between pretrained and finetuned representations by matching covariances of final-layer embeddings through a learnable orthogonal transform, while avoiding expensive pairwise similarity computations. RIFT is shown to be compatible with common finetuning approaches, improving representation similarity (CKA) and often preserving or enhancing downstream performance across medical image datasets and different backbones, including large Vision Transformers. The results demonstrate that adaptation and generalization can be jointly maintained, enabling more robust cross-domain transfer and suggesting a new direction for finetuning paradigms that prioritize both effective learning and preservation of pretrained semantic structure.

Abstract

Foundation models pretrained on large-scale natural images are widely adapted to various cross-domain low-resource downstream tasks, benefiting from generalizable and transferable patterns captured by their representations. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of model's original generalizability. In this paper, we argue that such tasks can be effectively adapted without sacrificing the benefits of pretrained representations. We approach this by introducing \textit{Representation Invariance FineTuning (RIFT)}, a regularization that maximizes the representation similarity between pretrained and finetuned models by leveraging orthogonal invariance of manifolds in a computationally efficient way. Experiments demonstrate that our method is compatible with mainstream finetuning methods, offering competitive or even enhanced performance and better preservation of the generalizability.

Exploring Representation Invariance in Finetuning

TL;DR

This paper tackles the problem that finetuning foundation models on cross-domain, low-resource tasks can erode pretrained representations and generalization. It introduces Representation Invariance FineTuning (RIFT), a regularization that enforces orthogonal invariance between pretrained and finetuned representations by matching covariances of final-layer embeddings through a learnable orthogonal transform, while avoiding expensive pairwise similarity computations. RIFT is shown to be compatible with common finetuning approaches, improving representation similarity (CKA) and often preserving or enhancing downstream performance across medical image datasets and different backbones, including large Vision Transformers. The results demonstrate that adaptation and generalization can be jointly maintained, enabling more robust cross-domain transfer and suggesting a new direction for finetuning paradigms that prioritize both effective learning and preservation of pretrained semantic structure.

Abstract

Foundation models pretrained on large-scale natural images are widely adapted to various cross-domain low-resource downstream tasks, benefiting from generalizable and transferable patterns captured by their representations. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of model's original generalizability. In this paper, we argue that such tasks can be effectively adapted without sacrificing the benefits of pretrained representations. We approach this by introducing \textit{Representation Invariance FineTuning (RIFT)}, a regularization that maximizes the representation similarity between pretrained and finetuned models by leveraging orthogonal invariance of manifolds in a computationally efficient way. Experiments demonstrate that our method is compatible with mainstream finetuning methods, offering competitive or even enhanced performance and better preservation of the generalizability.

Paper Structure

This paper contains 20 sections, 2 theorems, 29 equations, 7 figures, 11 tables.

Key Result

Proposition A3.3

Let $W = W_1 W_2 \cdots W_L$ be an $L$-layer linear network with $W_i \in \mathbb{R}^{d \times d}$, and assume that $W$ exhibits strong generalization in the sense of Assumption asm:pretrain_multi, i.e., Then there exist orthogonal matrices $Q_1, \dots, Q_L \in O(d)$ such that the rotated network has a strictly larger spectral norm than the original $W$. Intuitively, because the pretrained netwo

Figures (7)

  • Figure 1: (a) Orthogonal transformation of the pretrained representation. (b) Orthogonal transformation of the covariance.
  • Figure 2: (a) Applying Orthogonal transformation at the intermediate layer. (b) Applying orthogonal transformation at the last layer.
  • Figure 3: Representation similarity diminishes with layer-wise orthogonal constraints.
  • Figure 4: PCA distribution visualization. We compare first two principal components of LINEAR, FULL, and FULL+RIFT($\lambda=1$) features across five medical image datasets. The first figure summarizes the distance and overlap of the pretrained model features. Darker colors indicate higher feature density, while lighter colors indicate lower density.
  • Figure 5: Attention heatmap visualization. We give more qualitative results on zero-shot natural image classification to further demonstrate the generalizability of RIFT and RIFT*. The red boxes highlight the regions most attended to by each method. Images are taken from previously unseen datasets: Oxford-IIIT Pet, Oxford Flowers, Stanford Cars, and CUB-200.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 3.2: Similarity-Invariant Parameter Subspace
  • Definition A3.1: Multi-layer linear network with layerwise orthogonal rotations
  • Proposition A3.3: Informal
  • Theorem A3.5
  • proof
  • Remark A3.6: Interpretation and Significance