Table of Contents
Fetching ...

MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations

Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, jinyue Zhao, Wenrui Li, Yanting Chen

TL;DR

MIntRec2.0 introduces a large-scale, multimodal, multi-party benchmark for intent recognition and out-of-scope detection in conversations, featuring 1,245 dialogues, 15,040 utterances, and 30 fine-grained intents plus an OOS label, all annotated with speaker identities. A universal framework integrates utterance- and dialogue-level data, extracting text, video, and audio features and fusing them with MAG-BERT or MulT, while employing CE loss for in-scope and OE loss for OOS, plus a DOC-based open-world inference mechanism. Experiments reveal that multimodal information improves in-scope accuracy and OOS robustness, with larger gains in single-turn than multi-turn settings, and that LLMs like ChatGPT still lag behind humans, who excel with modest prior data. The work establishes a foundational resource and benchmark for research in robust, context-aware multimodal intent understanding in real-world, open-world conversations, and calls for further advances in context modeling and open-set recognition. The dataset and code are publicly available to foster broad adoption and ongoing progress in human-machine conversational intelligence.

Abstract

Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.

MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations

TL;DR

MIntRec2.0 introduces a large-scale, multimodal, multi-party benchmark for intent recognition and out-of-scope detection in conversations, featuring 1,245 dialogues, 15,040 utterances, and 30 fine-grained intents plus an OOS label, all annotated with speaker identities. A universal framework integrates utterance- and dialogue-level data, extracting text, video, and audio features and fusing them with MAG-BERT or MulT, while employing CE loss for in-scope and OE loss for OOS, plus a DOC-based open-world inference mechanism. Experiments reveal that multimodal information improves in-scope accuracy and OOS robustness, with larger gains in single-turn than multi-turn settings, and that LLMs like ChatGPT still lag behind humans, who excel with modest prior data. The work establishes a foundational resource and benchmark for research in robust, context-aware multimodal intent understanding in real-world, open-world conversations, and calls for further advances in context modeling and open-set recognition. The dataset and code are publicly available to foster broad adoption and ongoing progress in human-machine conversational intelligence.

Abstract

Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
Paper Structure (23 sections, 1 equation, 9 figures, 18 tables)

This paper contains 23 sections, 1 equation, 9 figures, 18 tables.

Figures (9)

  • Figure 1: An example from the MIntRec2.0 dataset. More examples are provided in the Appendix \ref{['sample_selection']}.
  • Figure 2: In-scope and out-of-scope data distribution.
  • Figure 3: Distribution of in-scope intents in the MIntRec2.0 dataset.
  • Figure 4: Overview of the benchmark framework for the MIntRec2.0 dataset.
  • Figure 5: Samples of the MIntRec2.0 dataset.
  • ...and 4 more figures