ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections

Massimo Bini; Karsten Roth; Zeynep Akata; Anna Khoreva

ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections

Massimo Bini, Karsten Roth, Zeynep Akata, Anna Khoreva

TL;DR

This work introduces ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning.

Abstract

Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters ($\sim$$10$-$100$ times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility. The code is available at https://github.com/mwbini/ether.

ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections

TL;DR

Abstract

times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility. The code is available at https://github.com/mwbini/ether.

Paper Structure (32 sections, 11 equations, 11 figures, 12 tables)

This paper contains 32 sections, 11 equations, 11 figures, 12 tables.

Introduction
Related Work
Method
Preliminaries
Parameter-Efficient Finetuning with Adapters.
Orthogonal Finetuning (OFT).
ETHER: Finetuning with Hyperplane Reflections
Relaxing Orthogonality in ETHER
Efficient ETHER through Block-Parallelism
Intriguing Properties of ETHER
Benchmark Experiments
ETHER for Image-generative Model Adaptation
Subject-driven Generation
Controllable Image Generation
ETHER for Language Models Adaptation
...and 17 more sections

Figures (11)

Figure 1: ETHER and ETHER+ sketches. We visualize either a single hyperplane reflection for ETHER or two interacting hyperplanes for ETHER+, parametrized unit normals $u$ (and $v$). Unlike ETHER, the final result of ETHER+ does not have to retain the original length $L$, as the need for hard reflections is softened, and orthogonality is no longer guaranteed.
Figure 2: Block-Parallel Computation scheme between $d$-dimensional block-diagonal transformation with $n$ blocks and a $d\times f$ -dimensional weight matrix $W$.
Figure 3: Change in model behavior as a function of perturbation strength, i.e. distance between weight transformation and identity matrix. As ETHER and ETHER+ are upper-bounded in perturbation by construction, catastrophic deterioration of model performances is rarely encountered, and weight transformations remain controllable even for maximal deviations. For standard approaches, s.a. OFT, larger deviations from the identity matrix may occur during training and result in substantial divergence from the pretrained model. Notice also that by breaking orthogonality constraints in ETHER+, both smaller and stronger semantic variants can be learned.
Figure 4: Distances as a function of learning rates between transformation and identity matrix (Transformation Distance), and finetuned and pretrained weights (Weights Distance). Distances obtained for subject-driven generation finetuning at convergence (1200 iterations). Results show distances magnitudes higher and unbounded for non-ETHER methods in both cases as learning rates increase.
Figure 5: mIoU and FID performances as a function of learning rates. Results are obtained for controllable generation S2I finetuning on Stable Diffusion, and reveal a much stronger learning rate robustness of ETHER-based methods; retaining strong performance across entire learning rate magnitudes.
...and 6 more figures

ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections

TL;DR

Abstract

ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections

Authors

TL;DR

Abstract

Table of Contents

Figures (11)