Table of Contents
Fetching ...

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Haiwen Diao, Bo Wan, Xu Jia, Yunzhi Zhuge, Ying Zhang, Huchuan Lu, Long Chen

TL;DR

SHERL tackles memory-heavy fine-tuning of large pre-trained models by decoupling adaptation into an early consolidation stage that mitigates cross-layer redundancy and a late regulation stage that leverages minimal deep pre-trained layers. This is achieved via the Multi-Tiered Sensing Adapter (MTSA), which unbinds from the backbone and orchestrates a two-route transfer: an anti-redundancy, cross-layer feature aggregation followed by a regulation stage that aligns features with the pre-trained knowledge. Across vision–language and NLP tasks, SHERL delivers strong accuracy–memory trade-offs, outperforming memory-efficient and parameter-efficient baselines under comparable budgets and showing broad compatibility with Transformer, CNN, and Encoder-Decoder backbones. The approach is validated through extensive ablations and demonstrations of interoperability with existing PETL methods, highlighting its practical potential for resource-limited deployment of large foundations models.

Abstract

Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting the exhaustive exploration of prior knowledge from pre-trained models. Moreover, the dependency and redundancy between cross-layer features are frequently overlooked, thereby submerging more discriminative representations and causing an inherent performance gap (vs. conventional PETL methods). Hence, we propose an innovative METL strategy called SHERL for resource-limited scenarios to decouple the entire adaptation into two successive and complementary processes. In the early route, intermediate outputs are consolidated via an anti-redundancy operation, enhancing their compatibility for subsequent interactions; thereby in the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead and regulate these fairly flexible features into more adaptive and powerful representations for new domains. Extensive ablations on vision-and-language and language-only tasks show that SHERL combines the strengths of both parameter and memory-efficient techniques, performing on-par or better across diverse architectures with lower memory during fine-tuning. Our code is publicly available at: https://github.com/Paranioar/SHERL.

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

TL;DR

SHERL tackles memory-heavy fine-tuning of large pre-trained models by decoupling adaptation into an early consolidation stage that mitigates cross-layer redundancy and a late regulation stage that leverages minimal deep pre-trained layers. This is achieved via the Multi-Tiered Sensing Adapter (MTSA), which unbinds from the backbone and orchestrates a two-route transfer: an anti-redundancy, cross-layer feature aggregation followed by a regulation stage that aligns features with the pre-trained knowledge. Across vision–language and NLP tasks, SHERL delivers strong accuracy–memory trade-offs, outperforming memory-efficient and parameter-efficient baselines under comparable budgets and showing broad compatibility with Transformer, CNN, and Encoder-Decoder backbones. The approach is validated through extensive ablations and demonstrations of interoperability with existing PETL methods, highlighting its practical potential for resource-limited deployment of large foundations models.

Abstract

Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting the exhaustive exploration of prior knowledge from pre-trained models. Moreover, the dependency and redundancy between cross-layer features are frequently overlooked, thereby submerging more discriminative representations and causing an inherent performance gap (vs. conventional PETL methods). Hence, we propose an innovative METL strategy called SHERL for resource-limited scenarios to decouple the entire adaptation into two successive and complementary processes. In the early route, intermediate outputs are consolidated via an anti-redundancy operation, enhancing their compatibility for subsequent interactions; thereby in the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead and regulate these fairly flexible features into more adaptive and powerful representations for new domains. Extensive ablations on vision-and-language and language-only tasks show that SHERL combines the strengths of both parameter and memory-efficient techniques, performing on-par or better across diverse architectures with lower memory during fine-tuning. Our code is publicly available at: https://github.com/Paranioar/SHERL.
Paper Structure (16 sections, 3 equations, 11 figures, 7 tables)

This paper contains 16 sections, 3 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview of (a) parameter-efficient Partially Tuning, Adapter Tuning, Prompt Tuning; and (b) memory-efficient Side-TuningTL:Side-Tuning, Ladder Side Tuning (LST)TL:LST, Univeral Parallel Tuning (UniPT)TL:UniPT. Red dotted line denotes the backward gradients.
  • Figure 2: Overview of the framework that mitigates the disparity and redundancy of intermediates across shallow layers, and thereby generates adaptive and compatible inputs for subsequent deep layer for feature regulation when transferred to new domains.
  • Figure 3: Overview of the Application over (a) single or cross-modality Transformer, (b) CNN, and (c) T5 or MDETR-like Encoder-Decoder architectures. The pre-trained base backbone and our proposed SHERL module are denoted as $\phi$ and $\varphi$, respectively.
  • Figure 4: Average accuracy (Ave. %) on various VL datasets with different aggregation strategies for early intermediate features.
  • Figure 5: Average accuracy (Ave. %) on GLUE benchmark by involving popular PETL methods in late feature regulation.
  • ...and 6 more figures