Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
Xuxin Cheng, Wanshi Xu, Zhihong Zhu, Hongxiang Li, Yuexian Zou
TL;DR
This paper tackles multi-intent spoken language understanding by introducing MMCL, a single-stage framework that employs margin-based, multi-level, multi-grained contrastive learning across utterance, slot, and word levels to enable mutual guidance between intent detection and slot filling. It combines a self-attentive encoder, a contrastive learning layer, a token-level multi-label intent decoder, a global graph interaction layer, and a self-distillation module, with a joint loss that integrates all signals. Empirical results on MixATIS and MixSNIPS demonstrate state-of-the-art performance, and ablations confirm the contribution of each contrastive level and the robustness benefits of self-distillation, with additional gains when using pre-trained language models. The work advances SLU by reducing error propagation typical of multi-stage systems and providing a principled framework for leveraging intra-sentence structure to improve both intent and slot predictions, with practical impact for robust, multi-intent dialogue systems.
Abstract
Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot filling. Although there are some SLU frameworks joint modeling the two subtasks and achieving high performance, most of them still overlook the inherent relationships between intents and slots and fail to achieve mutual guidance between the two subtasks. To solve the problem, we propose a multi-level multi-grained SLU framework MMCL to apply contrastive learning at three levels, including utterance level, slot level, and word level to enable intent and slot to mutually guide each other. For the utterance level, our framework implements coarse granularity contrastive learning and fine granularity contrastive learning simultaneously. Besides, we also apply the self-distillation method to improve the robustness of the model. Experimental results and further analysis demonstrate that our proposed model achieves new state-of-the-art results on two public multi-intent SLU datasets, obtaining a 2.6 overall accuracy improvement on the MixATIS dataset compared to previous best models.
