DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data
Junjie Wu, Jiangtao Xie, Zhaolin Zhang, Qilong Wang, Qinghua Hu, Peihua Li, Sen Xu
TL;DR
This work addresses the challenge of domain-specific multimodal learning for biology by replacing single-token CLIP embeddings with distribution-based representations of image-text pairs. It introduces DALIP, which aligns first- and second-order statistics via a PL-based distribution matching objective and employs a Multi-head Brownian Distance Covariance (MBDC) module to capture complex second-order relationships efficiently. A PlantMix-13M dataset (10M plant + 3M general) is built to balance domain-specific and general-domain performance, enabling robust cross-domain transfer. Empirical results across plant, biological, remote sensing, and medical imaging demonstrate that DALIP outperforms state-of-the-art domain-specific CLIP models while preserving general-domain capabilities, with ablations confirming the value of combining 1st- and 2nd-order statistics and the effectiveness of MBDC. The work contributes both a scalable methodology for domain adaptation in multimodal models and a substantial dataset to support future research.
Abstract
Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.
