Table of Contents
Fetching ...

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

Yiwei Guo, Shaobin Zhuang, Kunchang Li, Yu Qiao, Yali Wang

TL;DR

A general and concise TransAgent framework, which transports the knowledge of the isolated agents in a unified manner, and effectively guides CLIP to generalize with multi-source knowledge distillation, and achieves state-of-the-art performance on 11 visual recognition datasets.

Abstract

Vision-language foundation models (such as CLIP) have recently shown their power in transfer learning, owing to large-scale image-text pre-training. However, target domain data in the downstream tasks can be highly different from the pre-training phase, which makes it hard for such a single model to generalize well. Alternatively, there exists a wide range of expert models that contain diversified vision and/or language knowledge pre-trained on different modalities, tasks, networks, and datasets. Unfortunately, these models are "isolated agents" with heterogeneous structures, and how to integrate their knowledge for generalizing CLIP-like models has not been fully explored. To bridge this gap, we propose a general and concise TransAgent framework, which transports the knowledge of the isolated agents in a unified manner, and effectively guides CLIP to generalize with multi-source knowledge distillation. With such a distinct framework, we flexibly collaborate with 11 heterogeneous agents to empower vision-language foundation models, without further cost in the inference phase. Finally, our TransAgent achieves state-of-the-art performance on 11 visual recognition datasets. Under the same low-shot setting, it outperforms the popular CoOp with around 10% on average, and 20% on EuroSAT which contains large domain shifts.

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

TL;DR

A general and concise TransAgent framework, which transports the knowledge of the isolated agents in a unified manner, and effectively guides CLIP to generalize with multi-source knowledge distillation, and achieves state-of-the-art performance on 11 visual recognition datasets.

Abstract

Vision-language foundation models (such as CLIP) have recently shown their power in transfer learning, owing to large-scale image-text pre-training. However, target domain data in the downstream tasks can be highly different from the pre-training phase, which makes it hard for such a single model to generalize well. Alternatively, there exists a wide range of expert models that contain diversified vision and/or language knowledge pre-trained on different modalities, tasks, networks, and datasets. Unfortunately, these models are "isolated agents" with heterogeneous structures, and how to integrate their knowledge for generalizing CLIP-like models has not been fully explored. To bridge this gap, we propose a general and concise TransAgent framework, which transports the knowledge of the isolated agents in a unified manner, and effectively guides CLIP to generalize with multi-source knowledge distillation. With such a distinct framework, we flexibly collaborate with 11 heterogeneous agents to empower vision-language foundation models, without further cost in the inference phase. Finally, our TransAgent achieves state-of-the-art performance on 11 visual recognition datasets. Under the same low-shot setting, it outperforms the popular CoOp with around 10% on average, and 20% on EuroSAT which contains large domain shifts.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: An overview of our TransAgent. (a) TransAgent transfers multi-source knowledge from heterogeneous agents to enhance the generalization ability of vision-language foundation models. It demonstrates knowledge versatility, transfer flexibility and deployment efficiency through elaborate agent collaboration and knowledge ensemble strategy. (b) SOTA comparison for base-to-novel generalization on 11 visual recognition benchmarks. Our method outperforms previous SOTA, especially on the more diversified target domains.
  • Figure 2: Vision Agent Collaboration and Language Agent Collaboration. (a) VAC integrates visual knowledge via MoA gating and transfers the knowledge through layer-wise feature distillation. (b) LAC enhances the textual representations through class-specific feature distillation between the prompted textual feature and the gated textual feature.
  • Figure 3: Multi-modal Agent Collaboration.Top left: We first extract the cross attention maps from the T2I agents and then obtain the score vectors through LSE pooling. Top right: We compute the score vectors from the I2T agents as the cosine similarity between the projected visual feature and the LLM's textual feature. Finally, we perform score distillation between the learned score vectors and the gated score vectors to further align the learnable prompts.
  • Figure 4: Accuracy comparison in few-shot classification. TransAgent demonstrates state-of-the-art performance for all few-shot settings on different datasets, which proves promising learning capability even under extremely limited supervision.
  • Figure 5: Averaged gating weights of each agent on different datasets. Deeper color indicates more contributions to the gated feature(s) or score vectors.
  • ...and 1 more figures