MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer
Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun Zhou, Chengju Liu, Qijun Chen
TL;DR
This work tackles the generalization–specialization trade-off that arises when adapting vision–language foundation models to video by introducing MoTE, a mixture of temporal experts. MoTE employs multiple temporal FFN-style experts per Transformer layer, a multinomial routing policy to diversify learned data biases, and a weight merging regularization to preserve generalized knowledge while enabling specialization; it also adds temporal feature modulation to adapt test-time contributions semantically. Through extensive ablations and evaluations on Kinetics-400/600, UCF-101, HMDB-51, and SSv2, MoTE achieves state-of-the-art or competitive zero-shot and close-set results, including strong few-shot performance, while maintaining computational efficiency. The approach demonstrates that a unified model can reconcile generalization and specialization for video recognition, offering practical benefits for open-vocabulary and few-shot settings and providing insights into parameter-efficient transfer learning for vision–language models.
Abstract
Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}.
