Foundations of Multisensory Artificial Intelligence

Paul Pu Liang

Foundations of Multisensory Artificial Intelligence

Paul Pu Liang

TL;DR

This work establishes a principled foundation for multisensory AI by formalizing how modalities interact via redundancy, uniqueness, and synergy using Partial Information Decomposition (PID). It develops scalable estimators for PID (CVX and Batch), introduces FactorCL to factorize shared and unique information, and proposes MultiViz for model understanding. The thesis advances practical, scalable architectures (MultiBench, MulT, HighMMT) and visualization tools to generalize across many modalities and tasks, with real-world applications in healthcare, affective computing, and robotics. By standardizing benchmarks and proposing data-driven modality and interaction measures, it provides actionable guidance for dataset collection, model selection, and robust deployment of multisensory AI systems.

Abstract

Building multisensory AI systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. By synthesizing a range of theoretical frameworks and application domains, this thesis aims to advance the machine learning foundations of multisensory AI. In the first part, we present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task. These interactions are the basic building blocks in all multimodal problems, and their quantification enables users to understand their multimodal datasets, design principled approaches to learn these interactions, and analyze whether their model has succeeded in learning. In the second part, we study the design of practical multimodal foundation models that generalize over many modalities and tasks, which presents a step toward grounding large language models to real-world sensory modalities. We introduce MultiBench, a unified large-scale benchmark across a wide range of modalities, tasks, and research areas, followed by the cross-modal attention and multimodal transformer architectures that now underpin many of today's multimodal foundation models. Scaling these architectures on MultiBench enables the creation of general-purpose multisensory AI systems, and we discuss our collaborative efforts in applying these models for real-world impact in affective computing, mental health, cancer prognosis, and robotics. Finally, we conclude this thesis by discussing how future work can leverage these ideas toward more general, interactive, and safe multisensory AI.

Foundations of Multisensory Artificial Intelligence

TL;DR

Abstract

Paper Structure (179 sections, 8 theorems, 44 equations, 63 figures, 32 tables, 3 algorithms)

This paper contains 179 sections, 8 theorems, 44 equations, 63 figures, 32 tables, 3 algorithms.

Introduction
Foundations of Multimodal Interactions
Multisensory Foundation Models
Summary of Contributions
Other Contributions
Multimodal representation learning
Applications in affective computing, social intelligence, and healthcare
Real-world robustness, fairness, and privacy
Literature Survey and Taxonomy of Multimodal Challenges
Introduction
Key modalities and application domains
Foundational Principles in Multimodal Research
Principle 1: Modalities are heterogeneous
Principle 2: Modalities are connected
Principle 3: Modalities interact
...and 164 more sections

Key Result

Theorem 1

(Suboptimality of standard CL) When there is multi-view non-redundancy as in Definition eq:multiview_nonredundancy_assump, given optimal representations $\{Z_1,Z_2\}$ that satisfy Eq.(eq:standard_cl_z and $I(Z_1; Y | X_2)=I(Z_2; Y | X_1)=0$tian2020makes, we have that Correspondingly, the Bayes error rate $P_e(Z_1,Z_2):=1 - \mathbb{E}_{p(z_1, z_2)}\left[\max_{y\in Y} P\left(\hat{Y}=y \mid z_1, z_2

Figures (63)

Figure 1.0.1: This thesis is designed to advance the theoretical and computational foundations of multimodal machine learning, and enable the creation of next-generation multimodal technologies. It starts by identifying the common themes and open questions in the field, through a taxonomy of six core challenges in multimodal research: representation, alignment, reasoning, generation, transference, and quantification. The bulk of the thesis studies two core challenges in multimodal learning: (1) building a foundation for multimodal interactions that enables the quantification of multimodal interactions in data and their principled modeling using machine learning methods, and (2) the data requirements and model building blocks enabling generalization of knowledge between modalities, tasks, and their representations.
Figure 1.4.1: I have also pursued the following directions during my Ph.D. studies: (1) new machine learning and deep learning models to learn multimodal representations (without modeling generalization), (2) collaborating with real-world stakeholders to apply these methods in affective computing, socially intelligent AI, healthcare, and education, and (3) mitigating real-world issues of deploying multimodal models in the face of real-world noise topologies, dataset biases, and privacy concerns.
Figure 2.1.1: Core research challenges in multimodal learning: Every multimodal problem typically requires tackling representation and alignment: (1) Representation studies how to summarize multimodal data to reflect the heterogeneity and interconnections between individual modality elements, before (2) alignment captures the connections and interactions between multiple local elements according to their structure. After representation and alignment comes (3) reasoning, which aims to combine the information from multimodal evidence in a principled way that respects the structure of the problem to give more robust and interpretable predictions. While most systems aim to predict the label $y$, there are also cases where the goal is (4) generation, to learn a generative process to produce raw modalities that reflect cross-modal interactions, structure, and coherence, or (5) transference, to transfer information from high-resource modalities to low-resource ones and their representations. Finally, (6) quantification revisits the previous challenges to give deeper empirical and theoretical understanding of modality heterogeneity, interconnections, and the learning process.
Figure 2.2.1: The information present in different modalities will often show diverse qualities, structures, and representations. Dimensions of heterogeneity can be measured via differences in individual elements and their distribution, the structure of elements, as well as modality information, noise, and task relevance.
Figure 2.2.2: Modality connections describe how modalities are related and share commonalities, such as correspondences between the same concept in language and images or dependencies across spatial and temporal dimensions. Connections can be studied through both statistical and semantic perspectives.
...and 58 more figures

Theorems & Definitions (15)

Definition 1
Definition 2
Theorem 1
Theorem 2
proof
Definition 3
Definition 4
Theorem 3
Definition 5
Theorem 4
...and 5 more

Foundations of Multisensory Artificial Intelligence

TL;DR

Abstract

Foundations of Multisensory Artificial Intelligence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (63)

Theorems & Definitions (15)