Table of Contents
Fetching ...

CLIP-like Model as a Foundational Density Ratio Estimator

Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

This work reframes CLIP-like vision-language models as general-purpose density-ratio estimators and demonstrates two practical avenues: (1) simple, prompt-driven importance weighting to achieve domain-adaptive pretraining gains, and (2) KL-divergence-based analysis to quantify semantic diversity and guide data curation. By formalizing how InfoNCE/NCE logits reflect conditional-vs-m marginal-density ratios, the authors enable direct use of these models for distributional analysis and sample weighting without further training. Empirical results on multimodal data show that density-ratio-based methods improve downstream metrics, reveal semantic diversity signals via KL measures, and offer data-curation strategies competitive with large-scale filtering baselines like LAION2B. The extension to SigLIP demonstrates the generality of the density-ratio viewpoint across CLIP-like architectures, suggesting broad applicability in multimodal learning and data management scenarios.

Abstract

Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications. To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities. We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimation. Our Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points. We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering.

CLIP-like Model as a Foundational Density Ratio Estimator

TL;DR

This work reframes CLIP-like vision-language models as general-purpose density-ratio estimators and demonstrates two practical avenues: (1) simple, prompt-driven importance weighting to achieve domain-adaptive pretraining gains, and (2) KL-divergence-based analysis to quantify semantic diversity and guide data curation. By formalizing how InfoNCE/NCE logits reflect conditional-vs-m marginal-density ratios, the authors enable direct use of these models for distributional analysis and sample weighting without further training. Empirical results on multimodal data show that density-ratio-based methods improve downstream metrics, reveal semantic diversity signals via KL measures, and offer data-curation strategies competitive with large-scale filtering baselines like LAION2B. The extension to SigLIP demonstrates the generality of the density-ratio viewpoint across CLIP-like architectures, suggesting broad applicability in multimodal learning and data management scenarios.

Abstract

Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications. To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities. We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimation. Our Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points. We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering.

Paper Structure

This paper contains 20 sections, 28 equations, 21 figures, 1 table.

Figures (21)

  • Figure 1: Top and bottom image examples in MSCOCO captions ranked by KL divergence \ref{['eq:dkl']} through cosine similarity of CLIP. While the bottom samples have the same contexts like sports, the top samples have diverse contexts, including non-English contents.
  • Figure 2: Zero-shot classification performance comparison between baseline and our IWL method across three downstream datasets.
  • Figure 3: Top and bottom captions ranked by $D_\mathrm{KL}$.
  • Figure 4: N-gram probability coverage across $D_\mathrm{KL}$ deciles. Each decile in \ref{['fig:mscoco_img']} is a group of captions corresponding to images $i$ which has the same level of $D_\mathrm{KL}(i)$. Each decile in \ref{['fig:mscoco_txt']} is a group of captions $t$ which has the same level of $D_\mathrm{KL}(t)$.
  • Figure 5: Top and bottom image examples in MSCOCO captions ranked by $D_\mathrm{KLR}$ through cosine similarity of CLIP.
  • ...and 16 more figures