Table of Contents
Fetching ...

ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Samar Khanna, Medhanie Irgau, David B. Lobell, Stefano Ermon

TL;DR

ExPLoRA addresses the costly domain adaptation problem for vision transformers by extending unsupervised pre-training on a new domain starting from a natural-image pre-trained ViT. It achieves this with a compact parameter-efficient scheme: unfreeze 1-2 blocks and apply LoRA to remaining layers, producing W_T^* = W_S + \Delta_T, followed by supervised fine-tuning with a small parameter footprint. Empirically, ExPLoRA sets new state-of-the-art results on satellite imagery and performs well on diverse WiLDS domains, while reducing pre-training compute by up to 8x and trainable parameters by up to 16x compared with full-domain pre-training. The approach democratizes access to powerful foundation-model capabilities for resource-constrained settings and suggests broad applicability beyond satellite imagery.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the-art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available on the project website: https://samar-khanna.github.io/ExPLoRA/

ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

TL;DR

ExPLoRA addresses the costly domain adaptation problem for vision transformers by extending unsupervised pre-training on a new domain starting from a natural-image pre-trained ViT. It achieves this with a compact parameter-efficient scheme: unfreeze 1-2 blocks and apply LoRA to remaining layers, producing W_T^* = W_S + \Delta_T, followed by supervised fine-tuning with a small parameter footprint. Empirically, ExPLoRA sets new state-of-the-art results on satellite imagery and performs well on diverse WiLDS domains, while reducing pre-training compute by up to 8x and trainable parameters by up to 16x compared with full-domain pre-training. The approach democratizes access to powerful foundation-model capabilities for resource-constrained settings and suggests broad applicability beyond satellite imagery.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the-art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available on the project website: https://samar-khanna.github.io/ExPLoRA/
Paper Structure (53 sections, 6 equations, 11 figures, 20 tables, 1 algorithm)

This paper contains 53 sections, 6 equations, 11 figures, 20 tables, 1 algorithm.

Figures (11)

  • Figure 1: Consider two different image domains, $S$ and $T$. Above: the traditional paradigm of pre-training from scratch on each domain $S$, $T$ to yield $W_{S}$, $W_{T}$, and then fine-tuning on target datasets $i$ to yield $\Delta^{(i)}_{s}, \Delta^{(i)}_{t}$. Below: our approach, which is to initialize with pre-trained weights from $S$ and then learn unsupervised weights $\Delta_{T}$ for $T$ in a parameter-efficient manner.
  • Figure 2: An overview of ExPLoRA. The set ${\mathcal{L}}$ of L ViT blocks is partitioned into two sets: ${\mathcal{U}}$ (red), which denotes blocks whose parameters are completely unfrozen, and ${\mathcal{L}} \setminus {\mathcal{U}}$ (blue) which denotes blocks that undergo LoRA tuning (only on the $Q, V$ attention matrices). Note that the normalization layers are always unfrozen across all blocks.
  • Figure 3: The mean of the eigenvalues of the feature map outputted by each ViT block.
  • Figure 4: The variance of the eigenvalues of the feature map outputted by each ViT block.
  • Figure 5: Linear probing patches for position (local information), across all ViT blocks.
  • ...and 6 more figures