Table of Contents
Fetching ...

Vision-Language Model Selection and Reuse for Downstream Adaptation

Hao-Zhe Tan, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

TL;DR

The paper tackles the problem of selecting and reusing pre-trained Vision-Language Models (VLMs) for downstream tasks, proposing Model Label Learning (MLL) as a three-part framework: model labeling, model selection, and model reuse. It builds a semantic graph from WordNet synsets to label VLMs via pre-testing, uses GPT-4-generated captions and caption embeddings to align task descriptions with graph concepts, and ensembles top-$k$ models per class with entropy-based weighting to produce robust zero-shot predictions. A comprehensive benchmark of 49 VLMs and 17 downstream datasets demonstrates that MLL achieves strong zero-shot performance, scalability as the model hub grows, and robust, per-class model selection coupled with ensemble predictions. The method emphasizes efficiency (offline labeling), scalability (growing semantic graph and hub), and practical impact for deploying VLMs in varied real-world tasks.

Abstract

Pre-trained Vision-Language Models (VLMs) are becoming increasingly popular across various visual tasks, and several open-sourced VLM variants have been released. However, selecting the best-performing pre-trained VLM for a specific downstream task is challenging since no single VLM can achieve promising performance on all downstream tasks, and evaluating all available VLMs is impossible due to time and data limitations. To address this problem, this paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL). The proposal contains three key modules: \emph{model labeling}, which assigns labels to each VLM to describe their specialty and utility; \emph{model selection}, which matches the requirements of the target task with model labels; and \emph{model reuse}, which applies selected VLMs to the target task in an ensemble manner. The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent and the ability could grow with the number of candidate VLMs. We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs.

Vision-Language Model Selection and Reuse for Downstream Adaptation

TL;DR

The paper tackles the problem of selecting and reusing pre-trained Vision-Language Models (VLMs) for downstream tasks, proposing Model Label Learning (MLL) as a three-part framework: model labeling, model selection, and model reuse. It builds a semantic graph from WordNet synsets to label VLMs via pre-testing, uses GPT-4-generated captions and caption embeddings to align task descriptions with graph concepts, and ensembles top- models per class with entropy-based weighting to produce robust zero-shot predictions. A comprehensive benchmark of 49 VLMs and 17 downstream datasets demonstrates that MLL achieves strong zero-shot performance, scalability as the model hub grows, and robust, per-class model selection coupled with ensemble predictions. The method emphasizes efficiency (offline labeling), scalability (growing semantic graph and hub), and practical impact for deploying VLMs in varied real-world tasks.

Abstract

Pre-trained Vision-Language Models (VLMs) are becoming increasingly popular across various visual tasks, and several open-sourced VLM variants have been released. However, selecting the best-performing pre-trained VLM for a specific downstream task is challenging since no single VLM can achieve promising performance on all downstream tasks, and evaluating all available VLMs is impossible due to time and data limitations. To address this problem, this paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL). The proposal contains three key modules: \emph{model labeling}, which assigns labels to each VLM to describe their specialty and utility; \emph{model selection}, which matches the requirements of the target task with model labels; and \emph{model reuse}, which applies selected VLMs to the target task in an ensemble manner. The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent and the ability could grow with the number of candidate VLMs. We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs.

Paper Structure

This paper contains 28 sections, 10 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: The spider charts measure 49 models' capabilities across 7 downstream tasks and classes within a task, showing that the best-performing models vary across downstream tasks and classes, highlighting the importance of model selection for VLM. The evaluated 49 models align with those in the model hub, as discussed in Section \ref{['sec:model_hub']}.
  • Figure 2: The framework of MLL paradigm. Models added to the hub first undergo a pre-testing phase, during which they are assigned labels that describe their specific functionalities in the labeling module. When a downstream task is presented, the system selects relevant models in the selection module and ensembles them to address the task.
  • Figure 3: Average performance on 17 downstream tasks with the scaling of the model hub.