Table of Contents
Fetching ...

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

TL;DR

This paper addresses the limitation of single-vector representations in multimodal contrastive learning by introducing Weighted Point Set Embedding (WPSE), where each concept is represented as a weighted set of vectors and similarities are computed via kernelized pairwise interactions. It establishes a theoretical connection that the optimal symmetric InfoNCE similarity corresponds to the pointwise mutual information (PMI) and shows that, under this optimal similarity, a linear classifier on top of the learned WPSE representations can achieve near-optimal downstream performance. The authors prove a finite-sample excess-risk bound that decomposes into KL-divergence terms, analyze the impact of deviations from PMI, and demonstrate that a nonlinear kernel over weighted point sets can approximate PMI arbitrarily well with a mild generation assumption. Empirically, WPSE pretrained on large text–image datasets with random Fourier feature–based kernel approximations yields improved zero-shot transfer and linear classification results over CLIP baselines on multiple benchmarks, validating both the theory and practical utility of the approach.

Abstract

In typical multimodal contrastive learning, such as CLIP, encoders produce one point in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in the real world. For richer classes of the similarity, we propose the use of weighted point sets, namely, sets of pairs of weight and vector, as representations of instances. In this work, we theoretically show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP, which we call symmetric InfoNCE. We clarify that the optimal similarity that minimizes symmetric InfoNCE is the pointwise mutual information, and show an upper bound of excess risk on downstream classification tasks of representations that achieve the optimal similarity. In addition, we show that our proposed similarity based on weighted point sets consistently achieves the optimal similarity. To verify the effectiveness of our proposed method, we demonstrate pretraining of text-image representation models and classification tasks on common benchmarks.

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

TL;DR

This paper addresses the limitation of single-vector representations in multimodal contrastive learning by introducing Weighted Point Set Embedding (WPSE), where each concept is represented as a weighted set of vectors and similarities are computed via kernelized pairwise interactions. It establishes a theoretical connection that the optimal symmetric InfoNCE similarity corresponds to the pointwise mutual information (PMI) and shows that, under this optimal similarity, a linear classifier on top of the learned WPSE representations can achieve near-optimal downstream performance. The authors prove a finite-sample excess-risk bound that decomposes into KL-divergence terms, analyze the impact of deviations from PMI, and demonstrate that a nonlinear kernel over weighted point sets can approximate PMI arbitrarily well with a mild generation assumption. Empirically, WPSE pretrained on large text–image datasets with random Fourier feature–based kernel approximations yields improved zero-shot transfer and linear classification results over CLIP baselines on multiple benchmarks, validating both the theory and practical utility of the approach.

Abstract

In typical multimodal contrastive learning, such as CLIP, encoders produce one point in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in the real world. For richer classes of the similarity, we propose the use of weighted point sets, namely, sets of pairs of weight and vector, as representations of instances. In this work, we theoretically show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP, which we call symmetric InfoNCE. We clarify that the optimal similarity that minimizes symmetric InfoNCE is the pointwise mutual information, and show an upper bound of excess risk on downstream classification tasks of representations that achieve the optimal similarity. In addition, we show that our proposed similarity based on weighted point sets consistently achieves the optimal similarity. To verify the effectiveness of our proposed method, we demonstrate pretraining of text-image representation models and classification tasks on common benchmarks.
Paper Structure (37 sections, 8 theorems, 28 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 8 theorems, 28 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Proposition 4.1

Let $X$ and $Y$ denote two random variables having the joint probability density $p$. Then, the mutual information of $X$ and $Y$, $I(X,Y) := \mathop{\mathbb{E}}_{p(x,y)}\left[\ln \frac{p(x,y)}{p(x)p(y)}\right]$ is an upper bound of $-\mathcal{L}_\mathrm{NCE}(g)$. Moreover, if the function $g$ satis

Figures (2)

  • Figure 1: Overview of proposed method. Each encoder produces a weighted point set from each input. The encoders are optimized with the symmetric InfoNCE using the similarity matrix.
  • Figure 2: Proposed modification for encoders to produce a weighted point set. Encoders are modeled by Transformer. The encoders output all resultant vectors instead of just one vector at a certain position.

Theorems & Definitions (14)

  • Proposition 4.1: Restatement of Proposition 1 in zhang2023deep
  • Theorem 4.2
  • Lemma 4.3
  • Theorem 4.4
  • Theorem 5.1
  • proof
  • proof
  • Proposition C.1
  • proof
  • Proposition C.3: aronszajn1950theorysriperumbudur2011universality
  • ...and 4 more