Table of Contents
Fetching ...

Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning

Yuxiang Lu, Shengcao Cao, Yu-Xiong Wang

TL;DR

This work tackles the uneven transfer of knowledge from diverse Vision Foundation Models by revealing their task-specific biases and proposing a bias-preserving distillation framework. The Swiss Army Knife (SAK) integrates a Teacher-Agnostic Stem with per-teacher Adapter Paths and a Mixture-of-Representations Router to dynamically fuse representations for multiple tasks, trained in two stages on ImageNet and downstream data. Empirically, SAK achieves state-of-the-art multi-task gains on PASCAL-Context and NYUD-v2, notably surpassing prior multi-teacher distillation approaches while maintaining efficiency, with robust ablations supporting the importance of bias preservation and adaptive fusion. This approach offers a scalable pathway to harness multiple VFMs for coordinated, robust multi-task vision, enabling easier extension to new teachers and tasks while reducing inference overhead.

Abstract

Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs. Project page: https://innovator-zero.github.io/SAK/ .

Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning

TL;DR

This work tackles the uneven transfer of knowledge from diverse Vision Foundation Models by revealing their task-specific biases and proposing a bias-preserving distillation framework. The Swiss Army Knife (SAK) integrates a Teacher-Agnostic Stem with per-teacher Adapter Paths and a Mixture-of-Representations Router to dynamically fuse representations for multiple tasks, trained in two stages on ImageNet and downstream data. Empirically, SAK achieves state-of-the-art multi-task gains on PASCAL-Context and NYUD-v2, notably surpassing prior multi-teacher distillation approaches while maintaining efficiency, with robust ablations supporting the importance of bias preservation and adaptive fusion. This approach offers a scalable pathway to harness multiple VFMs for coordinated, robust multi-task vision, enabling easier extension to new teachers and tasks while reducing inference overhead.

Abstract

Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs. Project page: https://innovator-zero.github.io/SAK/ .

Paper Structure

This paper contains 38 sections, 7 equations, 11 figures, 29 tables.

Figures (11)

  • Figure 1: (Left)Quantitative analysis of representation biases in Vision Foundation Models (VFMs), including DINOv2, CLIP, and SAM, on the PASCAL-Context dataset across five vision tasks, all using the ViT-B backbones with pretrained parameters frozen. VFMs exhibit advantages and disadvantages across different downstream tasks when compared to a conventional ImageNet-pretrained backbone. Our SAK model, distilled from these VFM teachers, achieves the best average performance with more balanced improvements, as indicated by its larger ratio of mean improvement to standard deviation ($\mu/\sigma$). (Right)Qualitative comparison of representation biases through representative examples from semantic segmentation and boundary detection tasks. DINOv2 captures localized features but occasionally confuses semantic categories; CLIP excels in object-level understanding but lacks fine pixel-level details; SAM produces precise masks in both tasks due to higher input resolution but struggles with semantic knowledge. Our SAK successfully combines the precise boundary detection of SAM with the accurate semantic understanding of DINOv2 and CLIP. Further details are discussed in Section \ref{['sec2']}.
  • Figure 2: Overview of our proposed SAK framework, which distills foundational knowledge from a committee of frozen VFM teachers into an efficient student model. The student model operates like a Swiss Army Knife, with the Teacher-Agnostic Stem (TAS) serving as the main branch to learn universal knowledge among teachers. Each Teacher-Specific Adapter Path (TSAP) acts as a specialized tool to preserve the inherent representation bias of each teacher. Task-specific Mixture-of-Representations (MoR) Routers are then employed to synergize the complementary strengths of the teachers' biases, adaptively combining multi-level representations from both TAS and TSAP to generate tailored features for each task.
  • Figure 3: Performance comparison on two datasets, based on ViT-B backbones. MTL Gain $\Delta_m$ on two datasets are shown in the legend, respectively.
  • Figure 4: Weights of different experts learned by MoR Routers.
  • Figure 5: Performance w.r.t. downstream data percentage. MTL Gain is computed w.r.t. single-task baseline on full dataset. SAK is the most robust in downstream tasks.
  • ...and 6 more figures