Table of Contents
Fetching ...

SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection

Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, Ruifeng Xu

TL;DR

SDIF-DA addresses the challenge of tri-modal (text, video, audio) intent detection under limited labeled data by combining a shallow-to-deep interaction framework with a ChatGPT-based data augmentation pipeline. The shallow module progressively aligns non-text modalities to text via cross-attention, producing intermediate representations, which are then fused with a Transformer in a deep interaction stage to yield robust predictions. Augmenting the training set with $25{,}000$ ChatGPT-generated utterances and employing an assist-learning loss $ abla ext{L}_{Aug}$ enable effective knowledge distillation from a large language model, improving performance, especially in low-resource and fine-grained intents. Empirical results on the MIntRec benchmark show state-of-the-art performance and clear benefits from each component, demonstrating the practical impact of combining structured cross-modal alignment with LLM-based data augmentation for multimodal dialogue systems.

Abstract

Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.

SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection

TL;DR

SDIF-DA addresses the challenge of tri-modal (text, video, audio) intent detection under limited labeled data by combining a shallow-to-deep interaction framework with a ChatGPT-based data augmentation pipeline. The shallow module progressively aligns non-text modalities to text via cross-attention, producing intermediate representations, which are then fused with a Transformer in a deep interaction stage to yield robust predictions. Augmenting the training set with ChatGPT-generated utterances and employing an assist-learning loss enable effective knowledge distillation from a large language model, improving performance, especially in low-resource and fine-grained intents. Empirical results on the MIntRec benchmark show state-of-the-art performance and clear benefits from each component, demonstrating the practical impact of combining structured cross-modal alignment with LLM-based data augmentation for multimodal dialogue systems.

Abstract

Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.
Paper Structure (10 sections, 7 equations, 4 figures, 2 tables)

This paper contains 10 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An example of multi-modal intent detection.
  • Figure 2: (a) Overall architecture of shallow-to-deep interaction framework (SDIF). It contains a hierarchical module to conduct alignment in a shallow interaction manner, and a transformer module to aggregate and fuse all information with a deep interaction fashion; (b) Workflow of ChatGPT-based data augmentation approach.
  • Figure 3: Low-resource performance.
  • Figure 4: Fine-grained analysis of two hard intent taxonomies. SDIF denotes our framework without data augmentation.