Table of Contents
Fetching ...

Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding

Hanlei Zhang, Qianrui Zhou, Hua Xu, Jianhua Su, Roberto Evans, Kai Gao

TL;DR

This work tackles open-world multimodal intent understanding by jointly addressing ID classification and OOD detection. It introduces MIntOOD, which combines a weighted fusion network for dynamic modality weighting, Dirichlet-based pseudo-OOD data generation, and multi-granularity discriminative learning including a cosine classifier and contrastive objectives. The approach achieves state-of-the-art or competitive ID accuracy and substantially improves OOD detection across three benchmarks, demonstrating strong generalization to unseen OOD data. The method provides practical implications for robust multimodal dialogue systems and establishes baselines and an OOD benchmark for future research.

Abstract

Multimodal intent understanding is a significant research area that requires effective leveraging of multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing the nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations for both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective not only enhances the discrimination between different ID classes but also captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3~10% increase in AUROC scores while achieving new state-of-the-art results in ID classification. Data and codes are available at https://github.com/thuiar/MIntOOD.

Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding

TL;DR

This work tackles open-world multimodal intent understanding by jointly addressing ID classification and OOD detection. It introduces MIntOOD, which combines a weighted fusion network for dynamic modality weighting, Dirichlet-based pseudo-OOD data generation, and multi-granularity discriminative learning including a cosine classifier and contrastive objectives. The approach achieves state-of-the-art or competitive ID accuracy and substantially improves OOD detection across three benchmarks, demonstrating strong generalization to unseen OOD data. The method provides practical implications for robust multimodal dialogue systems and establishes baselines and an OOD benchmark for future research.

Abstract

Multimodal intent understanding is a significant research area that requires effective leveraging of multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing the nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations for both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective not only enhances the discrimination between different ID classes but also captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3~10% increase in AUROC scores while achieving new state-of-the-art results in ID classification. Data and codes are available at https://github.com/thuiar/MIntOOD.

Paper Structure

This paper contains 25 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of in-distribution and out-of-distribution samples for multimodal intent understanding.
  • Figure 2: Overall architecture of MIntOOD. It begins by generating pseudo-OOD samples through convex combinations of features extracted from ID samples. A weighted feature fusion network is then designed to dynamically learn an importance score for each modality, serving as weights during feature fusion. To learn robust representations for both ID classification and OOD detection, MIntOOD focuses on three granularities: (a) coarse-grained binary information, learned via a binary classifier trained to distinguish ID or OOD classes, (b) fine-grained information for distinguishing ID classes, captured by using a cosine classifier to analyze the angular information, and (c) fine-grained information for distinguishing instance-level differences, achieved through contrastive learning that captures similarity relations among ID and OOD utterances.
  • Figure 3: A comparison between different OOD detection methods.
  • Figure 4: Confusion matrices for ID classes across the three datasets.
  • Figure 5: Distribution of OOD detection scores for ID and OOD data in the testing sets of the three datasets.