Table of Contents
Fetching ...

Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal

TL;DR

This study interrogates modality bias in multimodal intent detection using MIntRec-1 and MIntRec2.0, revealing a strong textual bias where text alone suffices for many samples and where text dominates even in multimodal tasks. It compares textual LLMs, non-LLM multimodal baselines, and multimodal LLMs, finding textual models like Mistral-7B often outperform multimodal systems on biased data. The authors propose a debiasing framework that removes text-biased samples, showing substantial performance drops across models and exposing the limited value of modality fusion in biased benchmarks. The work calls for unbiased datasets and adaptive, input-specific fusion mechanisms to realize robust multimodal intent detection in real-world settings.

Abstract

The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.

Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

TL;DR

This study interrogates modality bias in multimodal intent detection using MIntRec-1 and MIntRec2.0, revealing a strong textual bias where text alone suffices for many samples and where text dominates even in multimodal tasks. It compares textual LLMs, non-LLM multimodal baselines, and multimodal LLMs, finding textual models like Mistral-7B often outperform multimodal systems on biased data. The authors propose a debiasing framework that removes text-biased samples, showing substantial performance drops across models and exposing the limited value of modality fusion in biased benchmarks. The work calls for unbiased datasets and adaptive, input-specific fusion mechanisms to realize robust multimodal intent detection in real-world settings.

Abstract

The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.

Paper Structure

This paper contains 50 sections, 17 figures, 35 tables.

Figures (17)

  • Figure 1: Role of different modalities in Multimodal Intent Detection Task
  • Figure 2: Multimodal model confusion caused due to image frames provided
  • Figure 3: Textual LLM prompt for finetuning
  • Figure 4: Inferencing prompt (for models used without training, few-shot)
  • Figure 5: WordCloud of Agree, Apologize, and Thank labels
  • ...and 12 more figures