Table of Contents
Fetching ...

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

Xuxin Cheng, Wanshi Xu, Zhihong Zhu, Hongxiang Li, Yuexian Zou

TL;DR

This paper tackles multi-intent spoken language understanding by introducing MMCL, a single-stage framework that employs margin-based, multi-level, multi-grained contrastive learning across utterance, slot, and word levels to enable mutual guidance between intent detection and slot filling. It combines a self-attentive encoder, a contrastive learning layer, a token-level multi-label intent decoder, a global graph interaction layer, and a self-distillation module, with a joint loss that integrates all signals. Empirical results on MixATIS and MixSNIPS demonstrate state-of-the-art performance, and ablations confirm the contribution of each contrastive level and the robustness benefits of self-distillation, with additional gains when using pre-trained language models. The work advances SLU by reducing error propagation typical of multi-stage systems and providing a principled framework for leveraging intra-sentence structure to improve both intent and slot predictions, with practical impact for robust, multi-intent dialogue systems.

Abstract

Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot filling. Although there are some SLU frameworks joint modeling the two subtasks and achieving high performance, most of them still overlook the inherent relationships between intents and slots and fail to achieve mutual guidance between the two subtasks. To solve the problem, we propose a multi-level multi-grained SLU framework MMCL to apply contrastive learning at three levels, including utterance level, slot level, and word level to enable intent and slot to mutually guide each other. For the utterance level, our framework implements coarse granularity contrastive learning and fine granularity contrastive learning simultaneously. Besides, we also apply the self-distillation method to improve the robustness of the model. Experimental results and further analysis demonstrate that our proposed model achieves new state-of-the-art results on two public multi-intent SLU datasets, obtaining a 2.6 overall accuracy improvement on the MixATIS dataset compared to previous best models.

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

TL;DR

This paper tackles multi-intent spoken language understanding by introducing MMCL, a single-stage framework that employs margin-based, multi-level, multi-grained contrastive learning across utterance, slot, and word levels to enable mutual guidance between intent detection and slot filling. It combines a self-attentive encoder, a contrastive learning layer, a token-level multi-label intent decoder, a global graph interaction layer, and a self-distillation module, with a joint loss that integrates all signals. Empirical results on MixATIS and MixSNIPS demonstrate state-of-the-art performance, and ablations confirm the contribution of each contrastive level and the robustness benefits of self-distillation, with additional gains when using pre-trained language models. The work advances SLU by reducing error propagation typical of multi-stage systems and providing a principled framework for leveraging intra-sentence structure to improve both intent and slot predictions, with practical impact for robust, multi-intent dialogue systems.

Abstract

Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot filling. Although there are some SLU frameworks joint modeling the two subtasks and achieving high performance, most of them still overlook the inherent relationships between intents and slots and fail to achieve mutual guidance between the two subtasks. To solve the problem, we propose a multi-level multi-grained SLU framework MMCL to apply contrastive learning at three levels, including utterance level, slot level, and word level to enable intent and slot to mutually guide each other. For the utterance level, our framework implements coarse granularity contrastive learning and fine granularity contrastive learning simultaneously. Besides, we also apply the self-distillation method to improve the robustness of the model. Experimental results and further analysis demonstrate that our proposed model achieves new state-of-the-art results on two public multi-intent SLU datasets, obtaining a 2.6 overall accuracy improvement on the MixATIS dataset compared to previous best models.
Paper Structure (16 sections, 17 equations, 6 figures, 2 tables)

This paper contains 16 sections, 17 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An example of intent detection and slot filling.
  • Figure 2: Illustration of representations in utterance level for text (projected to 2D), circles in the same color means they have similar semantics. (a): representations learned by existing models. (b): ideal representations that we expect, where semantically similar pairs should stay close to each other and semantically unrelated pairs should stay away from each other.
  • Figure 3: The main architecture of MMCL. We introduce margin-based multi-level multi-grained contrastive learning to explore the inherent relationships and achieve mutual guidance between intent and slot in SLU. And self-distillation method is applied to improve the robustness of the model and prevent over-confidence.
  • Figure 4: The illustration of margin-based multi-level multi-grained contrastive learning examples. For utterance level, the model implements coarse granularity and fine granularity contrastive learning simultaneously.
  • Figure 5: Case study of our model compared to previous models in achieving mutual guidance and avoiding error propagation.
  • ...and 1 more figures