Table of Contents
Fetching ...

Towards Robust Multimodal Learning in the Open World

Fushuo Huo

TL;DR

The thesis tackles robust open-world multimodal learning by addressing three core issues: compositional generalization across modality primitives, robustness to missing modalities via cross-modal knowledge transfer, and hallucination mitigation in multimodal LLMs. It introduces ProCC (Progressive Cross-Primitive Compatibility) to learn feasible state-object compositions without external knowledge, C^2KD (Customized Cross-modal Knowledge Distillation) to preserve cross-modal knowledge under modality gaps, and SID (Self-Introspective Decoding) to suppress language priors in LVLMs through a training-free decoding strategy. Extensive experiments across OW-CZSL/pCZSL benchmarks, audio-visual, image-text, and RGB-depth tasks demonstrate state-of-the-art performance and robust gains, with ablations validating each component’s contribution. The work also shows SID reduces hallucinations and inference cost while preserving general capabilities, indicating practical impact for trustworthy open-world multimodal AI. Collectively, these approaches advance reliable multimodal reasoning, inference under partial inputs, and language-grounded generation in real-world AI systems.

Abstract

The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements.

Towards Robust Multimodal Learning in the Open World

TL;DR

The thesis tackles robust open-world multimodal learning by addressing three core issues: compositional generalization across modality primitives, robustness to missing modalities via cross-modal knowledge transfer, and hallucination mitigation in multimodal LLMs. It introduces ProCC (Progressive Cross-Primitive Compatibility) to learn feasible state-object compositions without external knowledge, C^2KD (Customized Cross-modal Knowledge Distillation) to preserve cross-modal knowledge under modality gaps, and SID (Self-Introspective Decoding) to suppress language priors in LVLMs through a training-free decoding strategy. Extensive experiments across OW-CZSL/pCZSL benchmarks, audio-visual, image-text, and RGB-depth tasks demonstrate state-of-the-art performance and robust gains, with ablations validating each component’s contribution. The work also shows SID reduces hallucinations and inference cost while preserving general capabilities, indicating practical impact for trustworthy open-world multimodal AI. Collectively, these approaches advance reliable multimodal reasoning, inference under partial inputs, and language-grounded generation in real-world AI systems.

Abstract

The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements.

Paper Structure

This paper contains 63 sections, 21 equations, 35 figures, 23 tables, 1 algorithm.

Figures (35)

  • Figure 1: Research framework of this thesis. We organize the positioning of this thesis within the field of robust multimodal learning in the open world. We classify the challenges into class-level and modality-level robustness and illustrate the contributions we focus on for each chapter.
  • Figure 2: Intuitive presentation of cross-modal knowledge distillation.
  • Figure 3: The overall concept of our method. Following the principle of 'forest before trees' neuro, human feedforward hierarchy underlies implicit processing for initial vision at a glance (i.e., green rectangle), and feedback connections add details to explicit vision with scrutiny (i.e., red rectangle). As for composition generalization learning, humans first ($\textcircled{\scriptsize{I}}$) learn to recognize overall objects, then ($\textcircled{\scriptsize{II}}$) gradually identify the scrutiny attribute of objects, i.e., state, and finally ($\textcircled{\scriptsize{III}}$) reasonably compose the object and state primitives. Inspired by this, we aim to progressively recognize the object and state primitives and guide the network to exploit discriminative information conditioned on learned knowledge via the CPC module.
  • Figure 4: The framework of ProCC. Features from the encoder ($\omega$) are respectively fed to the object and state ($\varphi_o$ and $\varphi_s$) classifiers, where the Cross-Primitive Compatibility (CPC) aims to model the cross-primitive interactions. Progressive learning strategy is proposed to gradually modulate primitive compatibility, especially for pCZSL. For detailed training procedure, please refers to Algorithm 2. Class Activation Maps (CAM) of input samples are illustrated to show visual attention.
  • Figure 5: The detailed framework of the object-state Cross-Primitive Compatibility (CPC$_{o \rightarrow s}$). Features from the object classifier ($\varphi_{o-1}$ and $\varphi_{o-2}$) are encoded by learnable Cross-Primitive Memory (CPM) units. Then respectively interact with state features ($\varphi_{s-1}$ and $\varphi_{s-2}$) to achieve compatibility of state features conditioned on objects.
  • ...and 30 more figures