Table of Contents
Fetching ...

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun Zhou, Chengju Liu, Qijun Chen

TL;DR

This work tackles the generalization–specialization trade-off that arises when adapting vision–language foundation models to video by introducing MoTE, a mixture of temporal experts. MoTE employs multiple temporal FFN-style experts per Transformer layer, a multinomial routing policy to diversify learned data biases, and a weight merging regularization to preserve generalized knowledge while enabling specialization; it also adds temporal feature modulation to adapt test-time contributions semantically. Through extensive ablations and evaluations on Kinetics-400/600, UCF-101, HMDB-51, and SSv2, MoTE achieves state-of-the-art or competitive zero-shot and close-set results, including strong few-shot performance, while maintaining computational efficiency. The approach demonstrates that a unified model can reconcile generalization and specialization for video recognition, offering practical benefits for open-vocabulary and few-shot settings and providing insights into parameter-efficient transfer learning for vision–language models.

Abstract

Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}.

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

TL;DR

This work tackles the generalization–specialization trade-off that arises when adapting vision–language foundation models to video by introducing MoTE, a mixture of temporal experts. MoTE employs multiple temporal FFN-style experts per Transformer layer, a multinomial routing policy to diversify learned data biases, and a weight merging regularization to preserve generalized knowledge while enabling specialization; it also adds temporal feature modulation to adapt test-time contributions semantically. Through extensive ablations and evaluations on Kinetics-400/600, UCF-101, HMDB-51, and SSv2, MoTE achieves state-of-the-art or competitive zero-shot and close-set results, including strong few-shot performance, while maintaining computational efficiency. The approach demonstrates that a unified model can reconcile generalization and specialization for video recognition, offering practical benefits for open-vocabulary and few-shot settings and providing insights into parameter-efficient transfer learning for vision–language models.

Abstract

Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}.

Paper Structure

This paper contains 55 sections, 11 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of existing VLM knowledge transfer methods. (a) Trade-off plots between zero-shot (Harmonic mean of UCF, HMDB, and K600) and close-set (K400) performance of recent CLIP-based methods (ViT-B/16). (b) As the number of temporal layers increases, the generalization of the standard Transformer layer severely degrades while our proposed MoTE consistently improves the zero-shot and close-set performance. (c) Our proposed MoTE seeks to construct a reconciled feature space between the optimal generalized and specialized manifolds.
  • Figure 2: An overview of the MoTE framework. (Left): We independently extract the feature of each frame with the CLIP visual encoder. Then, the frame token sequences from a given batch are routed to an activated expert for temporal pattern encoding. To regularize the merging process, we sample the temperature $\tau$ from a discrete set and use it to collapse multi-experts into one merged FFN. (Right): Temporal feature modulation. We modulate the contribution of the temporal feature with the semantic association, which is measured by the similarity between the proxy text features retrieved from the fine-tuning and the test categories. The modulated embedding is used for inference.
  • Figure 3: Expert-wise performance of MoTE. CLIP-Mean denotes a fine-tuned CLIP model with mean pooling for temporal modeling.
  • Figure 4: Visualization of the Top-1 accuracy for each video category sampled from UCF-101 with respect to the merged expert and each individual expert.
  • Figure 5: Illustration of optional architecture designs for the temporal expert. We omit the activation function between the projection matrices for brevity.
  • ...and 2 more figures