Table of Contents
Fetching ...

Efficient Stitchable Task Adaptation

Haoyu He, Zizheng Pan, Jing Liu, Jianfei Cai, Bohan Zhuang

TL;DR

ESTA addresses the challenge of producing a diverse palette of task-adapted networks under varying resource budgets without incurring heavy memory or multi-stage adaptation costs. It combines Parameter-efficient Stitch Fine-tuning (PST) with low-rank updates in self-attention and stitching layers, plus stitch-specific bias terms, and introduces a one-stage deployment pipeline that uses SNIP-inspired stitch importance scores to guide sampling and deployment. The framework yields numerous ready-to-deploy stitches with improved Pareto frontiers and substantially reduced training time and trainable parameters, and scales to LLaMA-based stitching for instruction-following tasks, producing Stitched LLaMA models that interpolate between smaller and larger baselines. Overall, ESTA offers a scalable, efficient path to versatile deployment across vision and language models by combining lightweight adaptation, informed stitch selection, and one-stage integration of stitching and deployment.

Abstract

The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However, most fine-tuning methods are designed to meet a specific resource budget. Recently, considering diverse deployment scenarios with various resource budgets, SN-Net is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising, SN-Net confronts new challenges when adapting it to new target domains, including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work, we present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way, we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore, we streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches, we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore, we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family, obtaining chatbot stitches of assorted sizes. Source code is available at https://github.com/ziplab/Stitched_LLaMA

Efficient Stitchable Task Adaptation

TL;DR

ESTA addresses the challenge of producing a diverse palette of task-adapted networks under varying resource budgets without incurring heavy memory or multi-stage adaptation costs. It combines Parameter-efficient Stitch Fine-tuning (PST) with low-rank updates in self-attention and stitching layers, plus stitch-specific bias terms, and introduces a one-stage deployment pipeline that uses SNIP-inspired stitch importance scores to guide sampling and deployment. The framework yields numerous ready-to-deploy stitches with improved Pareto frontiers and substantially reduced training time and trainable parameters, and scales to LLaMA-based stitching for instruction-following tasks, producing Stitched LLaMA models that interpolate between smaller and larger baselines. Overall, ESTA offers a scalable, efficient path to versatile deployment across vision and language models by combining lightweight adaptation, informed stitch selection, and one-stage integration of stitching and deployment.

Abstract

The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However, most fine-tuning methods are designed to meet a specific resource budget. Recently, considering diverse deployment scenarios with various resource budgets, SN-Net is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising, SN-Net confronts new challenges when adapting it to new target domains, including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work, we present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way, we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore, we streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches, we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore, we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family, obtaining chatbot stitches of assorted sizes. Source code is available at https://github.com/ziplab/Stitched_LLaMA
Paper Structure (22 sections, 3 equations, 15 figures, 3 tables)

This paper contains 22 sections, 3 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: (a) Illustration of Stitchable Neural Network pan2023stitchable. With two anchors from the same model family, SN-Net connects the early layers of the smaller one to the latter layers of the larger one with stitching layers to obtain a set of new networks with different performance-efficiency trade-offs, e.g., the path in Blue. (b) Overview of our PST method tailored for fine-tuning a palette of stitches, which integrates stitch-agnostic LoRA modules with stitch-specific bias terms, aiming to promote diverse representations among stitches while maintaining low trainable parameters. (c) Overview of our task-specific stitch sampling. We estimate the importance scores of the stitches with a scoring function $Q(\cdot, \cdot)$ and accumulate them as global statistics with moving averages. For a resource constraint $\gamma$, we sample with a categorical distribution $\pi({\mathcal{N}}_{\gamma})$ that is parameterized by the normalized importance scores so as to assign the important stitches with higher sampling probabilities. After fine-tuning, we directly deploy the stitches with the highest scores to avoid the costly evaluation stage.
  • Figure 2: Distribution of pair-wise gradient angles among stitches when updating shared weights at fine-tuning iteration 600. We highlight angle 90$^\circ$ with a dashed red line. For simplicity, we show the gradient angles among the combined query, key, and value projection matrices for a total of 32 stitches when stitching ViT-Ti and ViT-S anchors. Generally, the gradient angles are larger in the target domain Stanford Cars gebru2017cars than in the source domain ImageNet-1k russakovsky2015imagenet.
  • Figure 3: Performance comparisons with SN-Net pan2023stitchable for adapting ViT-Ti/S/B pre-trained on ImageNet-22k deng2009imagenet to Stanford Cars gebru2017cars, CUB-200-2011 wah2011caltech, Stanford Dogs Khosla_FGVC2011dogs, and NABirds van2015building. We denote individually fine-tuned anchors as yellow stars. We also show the number of trainable parameters.
  • Figure 4: Performance comparisons with SN-Net pan2023stitchable for adapting ViT-Ti/S/B pre-trained on ImageNet-22k deng2009imagenet to VTAB-1k zhai2019vtab and CIFAR-100 Krizhevsky09learningmultiple. We denote individually fine-tuned anchors as yellow stars and also show the number of trainable parameters.
  • Figure 5: Instruction-following comparison between Stitched LLaMA obtained by our ESTA and the Alpaca-LoRA 7B fine-tuned with LoRA hu2022lora.
  • ...and 10 more figures