Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

Xiaoyu Yang; Jie Lu; En Yu

Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

Xiaoyu Yang, Jie Lu, En Yu

TL;DR

This work addresses concept drift in multi-modal large language models by extending concept drift theory to multimodal data and introducing a T-distributed adapter (Thp) operating on a hyperspherical embedding space to mitigate tailed drift and enable OOD drift detection. The proposed T-distributed spherical metric supports drift-aware pre-training with image-text contrastive learning and enables drift-aware routing during fine-tuning via a mixture-of-experts with a KNN-based OOD detector. A new OpenMMlo dataset of approximately 740k image-caption pairs across long-tailed open-world categories is released to evaluate robustness and open-world generalization. Overall, the framework yields improved image-text alignment during pre-training and stronger downstream robustness to long-tail and OOD distributions, with public code and data to spur further research in multi-modal drift adaptation.

Abstract

Multi-modal Large Language Models (MLLMs) frequently face challenges from concept drift when dealing with real-world streaming data, wherein distributions change unpredictably. This mainly includes gradual drift due to long-tailed data and sudden drift from Out-Of-Distribution (OOD) data, both of which have increasingly drawn the attention of the research community. While these issues have been extensively studied in the individual domain of vision or language, their impacts on MLLMs in concept drift settings remain largely underexplored. In this paper, we reveal the susceptibility and vulnerability of Vision-Language (VL) models to significant biases arising from gradual drift and sudden drift, particularly in the pre-training. To effectively address these challenges, we propose a unified framework that extends concept drift theory to the multi-modal domain, enhancing the adaptability of the VL model to unpredictable distribution changes. Additionally, a T-distribution based drift adapter is proposed to effectively mitigate the bias induced by the gradual drift, which also facilitates the model in distinguishing sudden distribution changes through explicit distribution modeling. Extensive experiments demonstrate our method enhances the efficiency and accuracy of image-text alignment in the pre-training of VL models, particularly in the concept drift scenario. Moreover, various downstream tasks exhibit significant improvements in our model's ability to adapt to the long-tailed open world. Furthermore, we create a set of multi-modal datasets called OpenMMlo, specifically tailored for the long-tailed open-world setting, to validate our findings. To foster the development of the multi-modal community, we have made both OpenMMlo datasets and our code publicly available at: https://github.com/XiaoyuYoung/ConceptDriftMLLMs.

Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

TL;DR

Abstract

Paper Structure (24 sections, 23 equations, 4 figures, 7 tables)

This paper contains 24 sections, 23 equations, 4 figures, 7 tables.

Introduction
Methodology
Multi-modal Concept Drift Theory
T-distributed Adapter for Concept Drift
T-distributed Vision Language Model for the Concept Drift
Building Multi-modal Dataset OpenMMlo for the Long-Tailed Open World
Experiments
Taming the Tailed Drift and OOD Drift for Robust Fine-tuning
Concept Drift-Aware Image-Text Alignment for Effective Pre-training
Ablation Experiments
T-distributed Spherical Embedding in the Pre-training and Fine-tuning
Various Concentration $\kappa$ in T-Adapter
Conclusions And Outlook
Appendix
Related Works
...and 9 more sections

Figures (4)

Figure 1: The impacts of tailed drift and OOD drift on the vision language model in the stages of pre-training and fine-tuning, respectively. (a) In terms of the pre-training, we visualize the alignment results pre-trained on both a balanced dataset (denoted as BL) and an imbalanced dataset (as LT) without OOD samples, under the same balanced test set.The cosine metric is used to measure the distances between unit image and text features across various categories including OOD samples, which is expressed as degrees. A smaller degree indicates a higher level of similarity between the features. Thus, it provides a feature-level visualization of the intra-class compactness and inter-class separability in the vision language model. (b) In the context of fine-tuning in imbalance datasets, the mutual cosine distance between the centers of each category in the classifier is directly visualized to illustrate the feature space of the classifier, denoted as blue bars. Besides, the average cosine distance between each category center and OOD samples is calculated, which is represented as orange bars.
Figure 2: The workflow of our methodology, which consists of two stages: the pre-training of the vision-language model and the fine-tuning on downstream tasks. Within the data streaming, a drift adaptation window slides to detect changes in data distribution and subsequently update the model, in both pre-training and fine-tuning. In the pre-training, the T-distributed adapter aligns visual and textual feature space by image-text contrastive learning, with a large inter-class margin. Coupled with the language model loss, they drive the training of all modules. In the downstream task, the image encoder and the text decoder are frozen out of training, with a linear projector fusing image-text features. Additionally, a mixture of expert modules is leveraged with the T-distributed adapter as the router, allowing it to effectively adapt tail drift and perform OOD drift detection based on the distribution.
Figure 3: The proposed T-distributed spherical metric with various $\kappa$ and the classical vMF metric when $\kappa=1$.
Figure 4: Samples of OpenMMlo in training set, test set and open set.

Theorems & Definitions (4)

Remark 1.1
Definition 2.1
Remark A.1
proof

Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

TL;DR

Abstract

Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (4)