Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

Zichen Tian; Yaoyao Liu; Qianru Sun

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

Zichen Tian, Yaoyao Liu, Qianru Sun

TL;DR

MetaPEFT is proposed, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning that achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly.

Abstract

Training large foundation models from scratch for domain-specific applications is almost impossible due to data limits and long-tailed distributions -- taking remote sensing (RS) as an example. Fine-tuning natural image pre-trained models on RS images is a straightforward solution. To reduce computational costs and improve performance on tail classes, existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer. However, we observe that fixed hyperparameters -- such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. MetaPEFT dynamically adjusts three key factors of PEFT on RS images: module insertion, layer selection, and module-wise learning rates, which collectively control the influence of PEFT modules across the network. We conduct extensive experiments on three transfer-learning scenarios and five datasets in both RS and natural image domains. The results show that MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly.

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

TL;DR

Abstract

Paper Structure (31 sections, 14 equations, 5 figures, 13 tables)

This paper contains 31 sections, 14 equations, 5 figures, 13 tables.

Introduction
Related Works
Method
PEFT and Its Hyperparameters
Limitations of Manual Optimization
Auto Optimization via Meta-Learning
Unified Modulator.
Bi-Level Optimization Framework.
Experiments
Model Adaptation Scenarios
Implementation Details
Ablation Studies and Method Comparisons
Conclusions
Theoretical Foundations
Equivalence and Advantages of Optimizing Scaling Factors rather than Learning Rate
...and 16 more sections

Figures (5)

Figure 1: Comparing PEFT methods for the model adaptation of IN21K $\to$ DOTA. (a) Bubble plot of overall accuracy, tail-class accuracy, and performance variance (in standard deviation) of 5 PEFT methods in 6 total versions. Additive methods exhibit consistently higher accuracy and lower variance than non-additive methods. (b) Inter-class feature distances of the PEFT methods measured by cosine similarity. Additive methods achieve 13% further feature distances (which means better discrimination among tail classes) with comparable head-class distances. (c) Accuracy heatmap for applying PEFT on different positions of ViT: on different intra-block layers v.s. among different attention blocks (depth). Deeper blocks yield better performance (86.5% to 90.4%), but the combination of optimal block and intra-block position shows unexpected degradation (0.6% drop for FFN layer with depth 10$\to$11). The marks optimal combination. (d) Accuracy heatmap of intra-block positions v.s. scaling factors. Different positions show distinct sensitivity to scaling factors, with sharp accuracy drops observed (e.g., applying PEFT on the attention output layer (denoted as Out) drops from 87% to 6.7% when its PEFT scaling factor increases from 2 to 4). These highlight the non-monotonic complexity of PEFT hyperparameters.
Figure 2: Architecture and optimization framework for MetaPEFT.In Figures (a)-(c), we illustrate how our proposed modulator $\gamma$ is integrated with three representative additive PEFT methods: (a) AdaptFormer with modulated up/down projections ($W_\mathrm{up}$/$W_\mathrm{down}$), (b) LoRA with modulated low-rank decomposition matrices ($A$/$B$), and (c) Adapter with modulated projection layers ($W_\mathrm{up}$/$W_\mathrm{down}$). The $\sigma$ denotes the non-linear activation function (e.g., ReLU). In Figure (d), we show our bi-level optimization framework. The inner loop optimizes PEFT parameters $\phi$ on training data $\mathcal{D}_\mathrm{train}$, and the outer loop updates modulator $\gamma$ on validation data $\mathcal{D}_\mathrm{val}$ sampled from the training set. The symbol $N\times$ indicates the operation is repeated for $N$ attention blocks (e.g., $N$=12 for ViT-B/16).
Figure S1: Accuracy heatmap for head/med/tail/overall classes. Visualization of accuracy heatmaps on different intra-block layers vs among different blocks (depth), where the block stands for the attention block of ViT. Results are reported on transfer scenarios IN21K$\to$DOTA. The optima in each heatmap is highlighted by a yellow box. Results show that: 1) the (c) Tail classes exhibit non-monotonic accuracy changes across positions, while (a) Head and (b) Medium classes show monotonic trends; 2) the optimal configuration in (c) determines the overall optimal configuration in (d), indicating that tail-class performance dominates the model's overall performance. These findings validate our position-aware optimization strategy for long-tailed datasets.
Figure S2: Linear relationship between rank and optimal scaling factor. The heatmaps show accuracy distributions across different ranks and scaling factors for (a) tail classes and (b) all classes. Results reported on IN21K$\to$CIFAR100. Yellow highlights indicate the optimal scaling factor for each rank (the rank 32 in Figure (b) has two equally optima). We can observe a consistent linear relationship, i.e., optimal scaling factor $\propto$ rank. This pattern suggests that rank optimization can be decoupled from remaining hyperparameters.
Figure S3: t-SNE Visualization of (a) LoRA and (b) LoRA + Ours.

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

TL;DR

Abstract

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)