Table of Contents
Fetching ...

ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Yuanchen Wu, Junlong Du, Ke Yan, Shouhong Ding, Xiaoqiang Li

TL;DR

The paper introduces ToVE, a framework for efficient vision-language learning that transfers knowledge from a hub of pre-trained vision experts to a frozen CLIP-based vision encoder. A token-aware gating network dynamically routes expert knowledge to vision tokens, and a residual knowledge transfer preserves CLIP token generalization while enhancing perception; low-contributing experts can be detached to improve inference, and a knowledge-merging step enables deploying a single, knowledge-enriched CLIP encoder without relying on expert inference. With a unified pretraining objective and auxiliary losses to balance expert routing, ToVE achieves competitive VL performance using two orders of magnitude less training data, and excels in zero-shot captioning and visual spatial reasoning. The approach also demonstrates compatibility with LVLM setups (ToVE_Vicuna) and offers practical benefits through ablations and visualizations of gating, making it a scalable alternative to large-scale, data-hungry models.

Abstract

Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

TL;DR

The paper introduces ToVE, a framework for efficient vision-language learning that transfers knowledge from a hub of pre-trained vision experts to a frozen CLIP-based vision encoder. A token-aware gating network dynamically routes expert knowledge to vision tokens, and a residual knowledge transfer preserves CLIP token generalization while enhancing perception; low-contributing experts can be detached to improve inference, and a knowledge-merging step enables deploying a single, knowledge-enriched CLIP encoder without relying on expert inference. With a unified pretraining objective and auxiliary losses to balance expert routing, ToVE achieves competitive VL performance using two orders of magnitude less training data, and excels in zero-shot captioning and visual spatial reasoning. The approach also demonstrates compatibility with LVLM setups (ToVE_Vicuna) and offers practical benefits through ablations and visualizations of gating, making it a scalable alternative to large-scale, data-hungry models.

Abstract

Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

Paper Structure

This paper contains 26 sections, 12 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: The comparison between Prismer-Z and the proposed ToVE on Novel Object Caption and Vision Spatial Reasoning.
  • Figure 2: Different vision experts can provide rich visual prior knowledge, which can be transferred to VL learning, and efficiently improve visual perception capability with limited, small-scale data.
  • Figure 3: The overall framework of ToVE. The vision tokens processed by the vision encoder $\boldsymbol{\mathrm{E}}_\text{vis}$ are assigned expert knowledge through the gating network, then enhanced with a “residual knowledge transfer” strategy before interacting with the language model. For the gating network, it dynamically assigns the optimal expert knowledge to each vision token for VL learning.
  • Figure 4: The overview of expert knowledge merging. The CLIP vision encoder enables the merging of expert knowledge into itself by predicting the knowledge-transferred vision tokens as an auxiliary learning target.
  • Figure 5: Impact of iteratively removing vision experts on zero-shot caption performance.
  • ...and 5 more figures