Table of Contents
Fetching ...

Minimal Interaction Separated Tuning: A New Paradigm for Visual Adaptation

Ningyuan Tang, Minghao Fu, Jianxin Wu

TL;DR

This work introduces Minimal Interaction Separated Tuning (MIST), a paradigm that enables fine-tuning of large vision models on low-resource devices by offloading backbone computation to the cloud and transferring a highly compressed, information-rich aggregate of intermediate features to an edge adaptor. By summing intermediate features as Gather^*(B,x,k) and pairing with a low-rank attention edge network (LAE), MIST achieves low communication, memory, and compute footprints while maintaining competitive accuracy across diverse vision tasks. Extensive VTAB-1K, few-shot, and domain-generalization experiments demonstrate that MIST matches or exceeds prior parameter-efficient methods, with superior edge-device efficiency and robustness to domain shifts; segmentation results on ADE-20k further indicate applicability beyond recognition. The approach is reinforced by ablations showing the necessity of both Gather^* and LAE, and by analyses of how the hyperparameter k and edge-tuning structure influence performance. Overall, MIST offers a practical pathway to deploy large pretrained vision models on edge devices with minimal interaction and sustained performance gains.

Abstract

The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on devices with low computational resources. We explore a new visual adaptation paradigm called separated tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on devices which possess only low computational resources (slow CPU, no GPU, small memory, etc.) Existing methods that are potentially suitable for our separated tuning paradigm are discussed. But, three major drawbacks hinder their application in separated tuning: low adaptation capability, large adapter network, and in particular, high information transfer overhead. To address these issues, we propose Minimal Interaction Separated Tuning, or MIST, which reveals that the sum of intermediate features from pretrained models not only has minimal information transfer but also has high adaptation capability. With a lightweight attention-based adaptor network, MIST achieves information transfer efficiency, parameter efficiency, computational and memory efficiency, and at the same time demonstrates competitive results on various visual adaptation benchmarks.

Minimal Interaction Separated Tuning: A New Paradigm for Visual Adaptation

TL;DR

This work introduces Minimal Interaction Separated Tuning (MIST), a paradigm that enables fine-tuning of large vision models on low-resource devices by offloading backbone computation to the cloud and transferring a highly compressed, information-rich aggregate of intermediate features to an edge adaptor. By summing intermediate features as Gather^*(B,x,k) and pairing with a low-rank attention edge network (LAE), MIST achieves low communication, memory, and compute footprints while maintaining competitive accuracy across diverse vision tasks. Extensive VTAB-1K, few-shot, and domain-generalization experiments demonstrate that MIST matches or exceeds prior parameter-efficient methods, with superior edge-device efficiency and robustness to domain shifts; segmentation results on ADE-20k further indicate applicability beyond recognition. The approach is reinforced by ablations showing the necessity of both Gather^* and LAE, and by analyses of how the hyperparameter k and edge-tuning structure influence performance. Overall, MIST offers a practical pathway to deploy large pretrained vision models on edge devices with minimal interaction and sustained performance gains.

Abstract

The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on devices with low computational resources. We explore a new visual adaptation paradigm called separated tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on devices which possess only low computational resources (slow CPU, no GPU, small memory, etc.) Existing methods that are potentially suitable for our separated tuning paradigm are discussed. But, three major drawbacks hinder their application in separated tuning: low adaptation capability, large adapter network, and in particular, high information transfer overhead. To address these issues, we propose Minimal Interaction Separated Tuning, or MIST, which reveals that the sum of intermediate features from pretrained models not only has minimal information transfer but also has high adaptation capability. With a lightweight attention-based adaptor network, MIST achieves information transfer efficiency, parameter efficiency, computational and memory efficiency, and at the same time demonstrates competitive results on various visual adaptation benchmarks.

Paper Structure

This paper contains 27 sections, 10 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of our proposed Separated Tuning. Visual samples for different tasks ($x_1,x_2,x_3,\dots$) are collected and transferred to the cloud server. Within the cloud server, a large pretrained model $B$ acts as a standalone feature extractor, producing intermediate feature sets $(B,x_i)$. A gather function compresses this set to $\mathrm{Gather}(B,x_i)$ to achieve minimal interaction but still keeps essential information in $(B,x_i)$ for downstream task learning. The gathered features for task $i$ is sent to a low-resource device $E_i$, where the fine-tuning is performed solely on $E_i$ with input $\mathrm{Gather}(B,x_i)$. During inference, the pretrained model extracts features, gather them to a minimal level, transfers only a small chunk of bytes to the low-resource device, which then makes decisions with small storage, computational, and memory costs.
  • Figure 2: Architecture of a typical ladder side tuning network sung2022lst, where intermediate features $z_i$ from the pretrained model $B$ are added to side blocks through side paths. The interaction (features to be transferred) $(B,x)$ are the concatenation of all $z_i$ ($i=0,1,\dots,N$).
  • Figure 3: Illustration of MIST and other fine-tuning methods. The Adapter method fine-tunes a model by inserting trainable adapter blocks. LoRA adds trainable low-rank decomposed matrices to parameters. Side-tuning trains a decoupled side network by passing intermediate features to the side network block-wise. Our MIST compresses intermediate features by the gather function $\mathrm{Gather^*}$, and passes it to a trainable edge network to adapt to a specific task.
  • Figure 4: Top-1 accuracy on fine-grained few-shot datasets with train set containing 1, 2, 4, 8, 16-shot per class.
  • Figure 5: Ablation study on gather function with different $k$ on VTAB. Figures from left to right respectively refer to "Natural", "Specialized", and "Structured" task group.
  • ...and 1 more figures