Towards Robust Multimodal Learning in the Open World
Fushuo Huo
TL;DR
The thesis tackles robust open-world multimodal learning by addressing three core issues: compositional generalization across modality primitives, robustness to missing modalities via cross-modal knowledge transfer, and hallucination mitigation in multimodal LLMs. It introduces ProCC (Progressive Cross-Primitive Compatibility) to learn feasible state-object compositions without external knowledge, C^2KD (Customized Cross-modal Knowledge Distillation) to preserve cross-modal knowledge under modality gaps, and SID (Self-Introspective Decoding) to suppress language priors in LVLMs through a training-free decoding strategy. Extensive experiments across OW-CZSL/pCZSL benchmarks, audio-visual, image-text, and RGB-depth tasks demonstrate state-of-the-art performance and robust gains, with ablations validating each component’s contribution. The work also shows SID reduces hallucinations and inference cost while preserving general capabilities, indicating practical impact for trustworthy open-world multimodal AI. Collectively, these approaches advance reliable multimodal reasoning, inference under partial inputs, and language-grounded generation in real-world AI systems.
Abstract
The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements.
