SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection
Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, Ruifeng Xu
TL;DR
SDIF-DA addresses the challenge of tri-modal (text, video, audio) intent detection under limited labeled data by combining a shallow-to-deep interaction framework with a ChatGPT-based data augmentation pipeline. The shallow module progressively aligns non-text modalities to text via cross-attention, producing intermediate representations, which are then fused with a Transformer in a deep interaction stage to yield robust predictions. Augmenting the training set with $25{,}000$ ChatGPT-generated utterances and employing an assist-learning loss $ abla ext{L}_{Aug}$ enable effective knowledge distillation from a large language model, improving performance, especially in low-resource and fine-grained intents. Empirical results on the MIntRec benchmark show state-of-the-art performance and clear benefits from each component, demonstrating the practical impact of combining structured cross-modal alignment with LLM-based data augmentation for multimodal dialogue systems.
Abstract
Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.
