Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding
Hanlei Zhang, Qianrui Zhou, Hua Xu, Jianhua Su, Roberto Evans, Kai Gao
TL;DR
This work tackles open-world multimodal intent understanding by jointly addressing ID classification and OOD detection. It introduces MIntOOD, which combines a weighted fusion network for dynamic modality weighting, Dirichlet-based pseudo-OOD data generation, and multi-granularity discriminative learning including a cosine classifier and contrastive objectives. The approach achieves state-of-the-art or competitive ID accuracy and substantially improves OOD detection across three benchmarks, demonstrating strong generalization to unseen OOD data. The method provides practical implications for robust multimodal dialogue systems and establishes baselines and an OOD benchmark for future research.
Abstract
Multimodal intent understanding is a significant research area that requires effective leveraging of multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing the nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations for both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective not only enhances the discrimination between different ID classes but also captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3~10% increase in AUROC scores while achieving new state-of-the-art results in ID classification. Data and codes are available at https://github.com/thuiar/MIntOOD.
