Table of Contents
Fetching ...

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

Zhengyang Liang, Meiyu Liang, Wei Huang, Yawen Li, Zhe Xue

TL;DR

This work tackles the challenge of deploying pretrained multimodal large models in resource-constrained settings by introducing a dynamic self-adaptive multiscale distillation framework. A BEiT-3 teacher guides a ViT-BERT student through multiple distillation losses—contrastive, feature, cosine similarity, and hard negative—combined with a dynamic loss balancer that adapts weights during training. The method relies only on teacher output features and image-level information, achieving state-of-the-art cross-modal retrieval on Flickr30k and MSCOCO while reducing reliance on region-level cues. Empirical results demonstrate robust performance gains, reduced model complexity, and faster inference, highlighting practical potential for lightweight multimodal systems.

Abstract

In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

TL;DR

This work tackles the challenge of deploying pretrained multimodal large models in resource-constrained settings by introducing a dynamic self-adaptive multiscale distillation framework. A BEiT-3 teacher guides a ViT-BERT student through multiple distillation losses—contrastive, feature, cosine similarity, and hard negative—combined with a dynamic loss balancer that adapts weights during training. The method relies only on teacher output features and image-level information, achieving state-of-the-art cross-modal retrieval on Flickr30k and MSCOCO while reducing reliance on region-level cues. Empirical results demonstrate robust performance gains, reduced model complexity, and faster inference, highlighting practical potential for lightweight multimodal systems.

Abstract

In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.
Paper Structure (27 sections, 13 equations, 2 figures, 10 tables)

This paper contains 27 sections, 13 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Our framework efficiently leverages pre-trained multimodal large models, utilizing only their output features and raw image-level data. It dynamically and adaptively accomplishes knowledge distillation across multiple scales—including contrastive distillation, feature distillation, similarity distillation, and hard negative sample distillation. This streamlined approach enables the student model to effectively master the intricate structural feature space of the teacher model.
  • Figure 2: 3D Visualizations of Feature Space after PCA