Table of Contents
Fetching ...

Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning

Sanghyuk Chun

TL;DR

This paper argues that multiplicity—the inherent many-to-many relationships across modalities—is an inevitable bottleneck in multimodal learning, arising from intra-modal variability, asymmetry, and task-dependent alignment. It surveys how multiplicity permeates data construction, training (notably contrastive and retrieval-based methods), and evaluation, highlighting issues like input and matching ambiguity, false negatives, and unreliable benchmarks. The authors discuss current attempts (noisy-label approaches, multiple embeddings, probabilistic modeling, and mixture-of-experts) and emphasize that a unified, multiplicity-aware framework is needed. They advocate for task-driven data collection, multiplicity-aware evaluation metrics, and new modeling paradigms (e.g., stochastic embeddings, conditional and compositional approaches) to robustly handle real-world multimodal data and achieve more reliable, scalable systems.

Abstract

Multimodal learning has seen remarkable progress, particularly with the emergence of large-scale pre-training across various modalities. However, most current approaches are built on the assumption of a deterministic, one-to-one alignment between modalities. This oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. This phenomenon, named multiplicity, is not a side-effect of noise or annotation error, but an inevitable outcome of semantic abstraction, representational asymmetry, and task-dependent ambiguity in multimodal tasks. This position paper argues that multiplicity is a fundamental bottleneck that manifests across all stages of the multimodal learning pipeline: from data construction to training and evaluation. This paper examines the causes and consequences of multiplicity, and highlights how multiplicity introduces training uncertainty, unreliable evaluation, and low dataset quality. This position calls for new research directions on multimodal learning: novel multiplicity-aware learning frameworks and dataset construction protocols considering multiplicity.

Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning

TL;DR

This paper argues that multiplicity—the inherent many-to-many relationships across modalities—is an inevitable bottleneck in multimodal learning, arising from intra-modal variability, asymmetry, and task-dependent alignment. It surveys how multiplicity permeates data construction, training (notably contrastive and retrieval-based methods), and evaluation, highlighting issues like input and matching ambiguity, false negatives, and unreliable benchmarks. The authors discuss current attempts (noisy-label approaches, multiple embeddings, probabilistic modeling, and mixture-of-experts) and emphasize that a unified, multiplicity-aware framework is needed. They advocate for task-driven data collection, multiplicity-aware evaluation metrics, and new modeling paradigms (e.g., stochastic embeddings, conditional and compositional approaches) to robustly handle real-world multimodal data and achieve more reliable, scalable systems.

Abstract

Multimodal learning has seen remarkable progress, particularly with the emergence of large-scale pre-training across various modalities. However, most current approaches are built on the assumption of a deterministic, one-to-one alignment between modalities. This oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. This phenomenon, named multiplicity, is not a side-effect of noise or annotation error, but an inevitable outcome of semantic abstraction, representational asymmetry, and task-dependent ambiguity in multimodal tasks. This position paper argues that multiplicity is a fundamental bottleneck that manifests across all stages of the multimodal learning pipeline: from data construction to training and evaluation. This paper examines the causes and consequences of multiplicity, and highlights how multiplicity introduces training uncertainty, unreliable evaluation, and low dataset quality. This position calls for new research directions on multimodal learning: novel multiplicity-aware learning frameworks and dataset construction protocols considering multiplicity.

Paper Structure

This paper contains 22 sections, 4 figures.

Figures (4)

  • Figure 1: How unimodal task and multimodal task are different? Unimodal tasks assume a fixed and pre-defined label set. Even though we add more instances in the dataset, the number of correspondences increases constantly, and the new instance does not affect to the existing instances. However, the correspondences in multimodal datasets, assuming one-to-one mapping, increase $O(N)$ by adding one multimodal pair.
  • Figure 2: How multiplicity occurs? The source of multiplicity in multimodal datasets is diverse.
  • Figure 3: Multiplicity induces ambiguity. (a) If we have an ideal dataset consists of the full pairwise annotations, an input should correspond to multiple instances from the other modality. The current one-to-one paradigm cannot handle this. (b) In practice, we have sparsely annotated pairwise annotations: each input only corresponds to one instance. In this case, multiplicity introduces a new uncertainty, named matching ambiguity.
  • Figure 4: Human preference vs. evaluation metrics under multiplicity. Chun et al.chun2022eccv_caption asked human annotators to compare four retrieval scenarios: (A) only top-1 is wrong, (B) only top-1 is correct, (C) top-1 to top-5 are wrong, and (D) only top-5 is correct. By comparing them in pairwise, the human preference (HP) score is computed by the BT model bradley1952rank. mAP@R musgrave2020metric is highly correlated to HP, while R@Ks are often irrelevant.