DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition
HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai
TL;DR
DialogGraph-LLM tackles the challenge of end-to-end audio dialogue intent recognition under limited labeled data by introducing MR-DAN, a graph-based attention network that explicitly models temporal, speaker, and semantic relationships among utterances. The framework integrates a graph-derived representation with raw acoustic features inside a multimodal LLM via structured prompting, enabling joint reasoning over structure and paralinguistics. A novel adaptive SSL strategy leverages LLM-generated pseudo-labels with EMA-based global and class-specific thresholds, augmented by a Delta-Margin filter and class-balanced Top-K selection to handle class imbalance and noise. Empirical results on MarketCalls and MIntRec2.0 show substantial gains over strong baselines, including state-of-the-art performance on MIntRec2.0 and notable improvements in real-world MarketCalls scenarios, underscoring the practical value of explicit dialogue structure and adaptive unlabeled-data utilization in audio-rich domains. The work provides a scalable pathway for multimodal dialogue understanding that combines graph reasoning with powerful LLMs, while acknowledging limitations related to fixed graph schemas and potential SSL noise, which guide future enhancements across backbone choices and more robust SSL strategies.
Abstract
Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM's superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.
