Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

Paul Pu Liang; Chun Kai Ling; Yun Cheng; Alex Obolenskiy; Yudong Liu; Rohan Pandey; Alex Wilf; Louis-Philippe Morency; Ruslan Salakhutdinov

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

Paul Pu Liang, Chun Kai Ling, Yun Cheng, Alex Obolenskiy, Yudong Liu, Rohan Pandey, Alex Wilf, Louis-Philippe Morency, Ruslan Salakhutdinov

TL;DR

This work tackles the problem of quantifying multimodal interactions when only labeled unimodal data are available alongside unlabeled multimodal data. It adopts Partial Information Decomposition to define redundancy, uniqueness, and synergy, and derives computable lower bounds on synergy from redundancy and from unimodal classifier disagreement, plus an upper bound via min-entropy couplings. The authors validate these bounds on synthetic and large real-world datasets, showing that they track true interactions and can predict multimodal model performance, guiding data collection and model selection. They also provide practical guidelines and discuss computational aspects, including discretization of continuous modalities and the NP-hardness of exact min-entropy couplings, offering tractable approximations. Overall, the results establish a data-driven, information-theoretic framework to plan multimodal fusion strategies under labeling constraints and to anticipate when complex fusion will yield gains.

Abstract

In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: how modalities combine to provide new task-relevant information that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurring multimodal data (e.g., unlabeled images and captions, video and corresponding audio) but when labeling them is time-consuming. Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds to quantify the amount of multimodal interactions in this semi-supervised setting. We propose two lower bounds: one based on the shared information between modalities and the other based on disagreement between separately trained unimodal classifiers, and derive an upper bound through connections to approximate algorithms for min-entropy couplings. We validate these estimated bounds and show how they accurately track true interactions. Finally, we show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

TL;DR

Abstract

Paper Structure (63 sections, 9 theorems, 44 equations, 11 figures, 7 tables)

This paper contains 63 sections, 9 theorems, 44 equations, 11 figures, 7 tables.

Introduction
Related Work and Technical Background
Semi-supervised multimodal learning
Multimodal interactions and information theory
Estimating Semi-supervised Multimodal Interactions
Understanding relationships between interactions
Synergy and redundancy
Synergy and uniqueness
Lower and upper bounds on synergy
Lower bound using redundancy
Lower bound using uniqueness
Upper bound on synergy
Experiments
Verifying interaction estimation in semi-supervised learning
Synthetic bitwise datasets
...and 48 more sections

Key Result

Theorem 1

(Lower-bound on synergy via redundancy) We relate $S$ to modality dependence

Figures (11)

Figure 1: We study the relationships between (left) synergy and redundancy as a result of the task $Y$ either increasing or decreasing the shared information between $X_1$ and $X_2$ (i.e., common cause structures as opposed to redundancy in common effect), as well as (right) synergy and uniqueness due to the disagreement between unimodal predictors resulting in a new prediction $y \neq y_1 \neq y_2$ (rather than uniqueness where $y = y_2 \neq y_1$).
Figure 2: Our two lower bounds $\underline{S}_\textrm{R}$ and $\underline{S}_\textrm{U}$ track actual synergy $S$ from below, and the upper bound $\overline{S}$ tracks $S$ from above. We find that $\underline{S}_\textrm{R},\underline{S}_\textrm{U}$ tend to approximate $S$ better than $\overline{S}$.
Figure 3: Datasets with higher estimated multimodal performance $\hat{P}_M$ tend to show improvements from unimodal to multimodal (left) and from simple to complex multimodal fusion (right).
Figure 4: Examples of valid and invalid colorings. Left vertices are teachers 1, 2, 3. Right vertices are classes 1, 2. The colors red, green, blue are for hours 1, 2, 3 respectively, color of teacher vertices are the hours where the teachers are available (by definition of RTT, the number of distinct colors per teacher vertex is equal to its degree). The color of an edge (red, green or blue) says that a teacher is assigned to that class at that hour. Figure \ref{['fig:valid-coloring-orig']} shows a valid coloring (or timetabling), since (i) all edges are colored, (ii) no edge of the same colors are adjacent, and (iii) edges adjacent to teachers correspond to the vertex's color. Figures \ref{['fig:fail-coloring-repeated-class-orig']}, \ref{['fig:fail-coloring-repeated-teacher-orig']}, \ref{['fig:fail-coloring-invalid-color-orig']} are invalid colorings because of same-colored edges being adjacent, or teacher vertex colors differing to adjacent edges.
Figure 5: Examples of valid and invalid colorings when holding rooms are included. For simplicity, we illustrate all constraints except those on $Q$. Left vertices are teachers 1, 2, 3 and holding rooms $Z_1, Z_2, Z_3$. Right vertices are classes 1, 2. The colors red, green, blue are for hours 1, 2, 3 respectively, color of teacher vertices are the hours where the teachers are available (by definition of RTT, the number of distinct colors per teacher vertex is equal to its degree). Border color of holding room vertices are the hour that the holding room is available. The color of an edge (red, green or blue) says that a teacher (or holding room) is assigned to that class at that hour. Gray edges are the "null" color, meaning that that waiting room is not used by that class. Figure \ref{['fig:valid-coloring']} shows a valid coloring (or timetabling), since all edges are colored, no edge of the same colors are adjacent (other than the gray ones), and edges adjacent to teachers correspond to the vertex's color. Figures \ref{['fig:fail-coloring-repeated-class']}, \ref{['fig:fail-coloring-repeated-teacher']}, \ref{['fig:fail-coloring-invalid-color']} are invalid colorings because of non-gray edges being adjacent, or teacher vertices being adjacent to colors different from itself.
...and 6 more figures

Theorems & Definitions (18)

Definition 1
Theorem 1
Definition 2
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Theorem 6
proof
Definition 3
...and 8 more

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

TL;DR

Abstract

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (18)