Table of Contents
Fetching ...

Split Adaptation for Pre-trained Vision Transformers

Lixu Wang, Bingqi Shang, Yi Li, Payal Mohapatra, Wei Dong, Xiao Wang, Qi Zhu

TL;DR

This work proposes a novel split adaptation (SA) method, inspired by split learning, that focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation and proves its superiority over state-of-the-art methods.

Abstract

Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation. Extensive experiments on multiple datasets validate SA's superiority over state-of-the-art methods and demonstrate its defense against advanced data reconstruction attacks while preventing model leakage with minimal computation cost on the client side. The source codes can be found at https://github.com/conditionWang/Split_Adaptation.

Split Adaptation for Pre-trained Vision Transformers

TL;DR

This work proposes a novel split adaptation (SA) method, inspired by split learning, that focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation and proves its superiority over state-of-the-art methods.

Abstract

Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation. Extensive experiments on multiple datasets validate SA's superiority over state-of-the-art methods and demonstrate its defense against advanced data reconstruction attacks while preventing model leakage with minimal computation cost on the client side. The source codes can be found at https://github.com/conditionWang/Split_Adaptation.

Paper Structure

This paper contains 27 sections, 14 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The comparison between different downstream task adaptation approaches. Our SA can achieve effective few-shot task adaptation with minimal computation cost on the client side while protecting both the model and data.
  • Figure 2: Overview of Split Adaptation (SA) for pre-trained ViT adaptation. SA divides the pre-trained ViT into a frontend and a backend. After applying Out-of-distribution Enhanced Quantization to the frontend, its quantized version is sent to the client. To mitigate the impact of quantization, SA adopts Out-of-distribution Quantization-aware Tuning to enhance backend's generalization. As for the client, it injects random noise to the received frontend and retrieves then replaces randomly selected patches to augment more client data representations, which are sent to the server for the final adaptation after being added with noise again.
  • Figure 2: Comparison of client memory (MB) and computation costs (Min) between SA and other baselines.
  • Figure 3: Visualization of our Hilbert Transform data augmentation. Compared to amplitude-exchange augmentation wang2022domain (bottom row), our augmentation method (top row) generates data that diverges further from the original in appearance while preserving the original semantics (dog category).
  • Figure 4: Sensitivity analysis of bi-level noisy representation extraction by changing the injected Laplace noise degree with a variety of z.