Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

Toshimitsu Uesaka; Taiji Suzuki; Yuhta Takida; Chieh-Hsin Lai; Naoki Murata; Yuki Mitsufuji

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

TL;DR

This paper addresses the limitation of single-vector representations in multimodal contrastive learning by introducing Weighted Point Set Embedding (WPSE), where each concept is represented as a weighted set of vectors and similarities are computed via kernelized pairwise interactions. It establishes a theoretical connection that the optimal symmetric InfoNCE similarity corresponds to the pointwise mutual information (PMI) and shows that, under this optimal similarity, a linear classifier on top of the learned WPSE representations can achieve near-optimal downstream performance. The authors prove a finite-sample excess-risk bound that decomposes into KL-divergence terms, analyze the impact of deviations from PMI, and demonstrate that a nonlinear kernel over weighted point sets can approximate PMI arbitrarily well with a mild generation assumption. Empirically, WPSE pretrained on large text–image datasets with random Fourier feature–based kernel approximations yields improved zero-shot transfer and linear classification results over CLIP baselines on multiple benchmarks, validating both the theory and practical utility of the approach.

Abstract

In typical multimodal contrastive learning, such as CLIP, encoders produce one point in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in the real world. For richer classes of the similarity, we propose the use of weighted point sets, namely, sets of pairs of weight and vector, as representations of instances. In this work, we theoretically show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP, which we call symmetric InfoNCE. We clarify that the optimal similarity that minimizes symmetric InfoNCE is the pointwise mutual information, and show an upper bound of excess risk on downstream classification tasks of representations that achieve the optimal similarity. In addition, we show that our proposed similarity based on weighted point sets consistently achieves the optimal similarity. To verify the effectiveness of our proposed method, we demonstrate pretraining of text-image representation models and classification tasks on common benchmarks.

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

TL;DR

Abstract

Paper Structure (37 sections, 8 theorems, 28 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 8 theorems, 28 equations, 2 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Multimodal contrastive representation learning in practice
Theoretical understanding of contrastive loss
Problem setup
Contrastive representation learning and symmetric InfoNCE
Downstream classification task
Theoretical guarantee via pointwise mutual information
Pointwise mutual information as optimal similarity
Pointwise mutual information estimator leads to a good linear classifier
Remark.
Excess risk analysis via the gap from the pointwise mutual information
Remark.
Augmented similarity by weighted point sets
Limitation of the inner-product similarity in finite dimensional spaces
...and 22 more sections

Key Result

Proposition 4.1

Let $X$ and $Y$ denote two random variables having the joint probability density $p$. Then, the mutual information of $X$ and $Y$, $I(X,Y) := \mathop{\mathbb{E}}_{p(x,y)}\left[\ln \frac{p(x,y)}{p(x)p(y)}\right]$ is an upper bound of $-\mathcal{L}_\mathrm{NCE}(g)$. Moreover, if the function $g$ satis

Figures (2)

Figure 1: Overview of proposed method. Each encoder produces a weighted point set from each input. The encoders are optimized with the symmetric InfoNCE using the similarity matrix.
Figure 2: Proposed modification for encoders to produce a weighted point set. Encoders are modeled by Transformer. The encoders output all resultant vectors instead of just one vector at a certain position.

Theorems & Definitions (14)

Proposition 4.1: Restatement of Proposition 1 in zhang2023deep
Theorem 4.2
Lemma 4.3
Theorem 4.4
Theorem 5.1
proof
proof
Proposition C.1
proof
Proposition C.3: aronszajn1950theorysriperumbudur2011universality
...and 4 more

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

TL;DR

Abstract

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (14)