Table of Contents
Fetching ...

DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

Junjie Wu, Jiangtao Xie, Zhaolin Zhang, Qilong Wang, Qinghua Hu, Peihua Li, Sen Xu

TL;DR

This work addresses the challenge of domain-specific multimodal learning for biology by replacing single-token CLIP embeddings with distribution-based representations of image-text pairs. It introduces DALIP, which aligns first- and second-order statistics via a PL-based distribution matching objective and employs a Multi-head Brownian Distance Covariance (MBDC) module to capture complex second-order relationships efficiently. A PlantMix-13M dataset (10M plant + 3M general) is built to balance domain-specific and general-domain performance, enabling robust cross-domain transfer. Empirical results across plant, biological, remote sensing, and medical imaging demonstrate that DALIP outperforms state-of-the-art domain-specific CLIP models while preserving general-domain capabilities, with ablations confirming the value of combining 1st- and 2nd-order statistics and the effectiveness of MBDC. The work contributes both a scalable methodology for domain adaptation in multimodal models and a substantial dataset to support future research.

Abstract

Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.

DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

TL;DR

This work addresses the challenge of domain-specific multimodal learning for biology by replacing single-token CLIP embeddings with distribution-based representations of image-text pairs. It introduces DALIP, which aligns first- and second-order statistics via a PL-based distribution matching objective and employs a Multi-head Brownian Distance Covariance (MBDC) module to capture complex second-order relationships efficiently. A PlantMix-13M dataset (10M plant + 3M general) is built to balance domain-specific and general-domain performance, enabling robust cross-domain transfer. Empirical results across plant, biological, remote sensing, and medical imaging demonstrate that DALIP outperforms state-of-the-art domain-specific CLIP models while preserving general-domain capabilities, with ablations confirming the value of combining 1st- and 2nd-order statistics and the effectiveness of MBDC. The work contributes both a scalable methodology for domain adaptation in multimodal models and a substantial dataset to support future research.

Abstract

Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.

Paper Structure

This paper contains 20 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a) Example images in plant domain. (b) Comparison of different CLIP models on general (ImageNet-1K) and specific (average on five plant tasks) domains, including OpenCLIP radford2021learning, BioCLIP stevens2024bioclip, ArborCLIP-O yang2024arboretum and our DALIP, where DALIP achieves promising results in both plant and general domains.
  • Figure 2: (a) Overview of our Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs, which are efficiently approximated by first- and second-order statistics of token features. Particularly, (b) a Multi-head Brownian Distance Covariance (MBDC) module is presented to efficiently acquire second-order statistics of token features, whose details can be found in Sec. \ref{['sec:MDBC']}.
  • Figure 3: Example illustration of generating detailed plant descriptions by prompting Qwen2-VL-7B Qwen2VL, where Latin and Common names, images, and tailored instruction prompts are used as inputs.
  • Figure 4: Results of various data mixing ratios in general (i.e., IN-1K) and plant domains (average on five tasks in Table \ref{['tab:Plantmix']}).
  • Figure S1: Convergence speed for DALIP and OpenCLIP with tuning on TOL-1M, where accuracies on Fungi are reported. For briefness, we show the results within the first 40 training epochs.
  • ...and 2 more figures