A Two-Stage Prediction-Aware Contrastive Learning Framework for Multi-Intent NLU
Guanhua Chen, Yutong Yao, Derek F. Wong, Lidia S. Chao
TL;DR
Multi-intent NLU faces challenges from overlapping intents and data scarcity. The authors propose a two-stage Prediction-Aware Contrastive Learning (PACL) framework that combines word-level pre-training with a prediction-aware contrastive fine-tuning stage, including dynamic role assignment and probability-weighted losses, to better exploit shared-intent information. An intent-slot attention module strengthens the coupling between multi-label intent detection and slot filling, yielding a more discriminative embedding space. On MixATIS, MixSNIPS, and StanfordLU, PACL outperforms strong baselines in both low-data and full-data regimes and accelerates convergence, with ablations confirming the contribution of each component, albeit with higher training cost due to contrastive learning.
Abstract
Multi-intent natural language understanding (NLU) presents a formidable challenge due to the model confusion arising from multiple intents within a single utterance. While previous works train the model contrastively to increase the margin between different multi-intent labels, they are less suited to the nuances of multi-intent NLU. They ignore the rich information between the shared intents, which is beneficial to constructing a better embedding space, especially in low-data scenarios. We introduce a two-stage Prediction-Aware Contrastive Learning (PACL) framework for multi-intent NLU to harness this valuable knowledge. Our approach capitalizes on shared intent information by integrating word-level pre-training and prediction-aware contrastive fine-tuning. We construct a pre-training dataset using a word-level data augmentation strategy. Subsequently, our framework dynamically assigns roles to instances during contrastive fine-tuning while introducing a prediction-aware contrastive loss to maximize the impact of contrastive learning. We present experimental results and empirical analysis conducted on three widely used datasets, demonstrating that our method surpasses the performance of three prominent baselines on both low-data and full-data scenarios.
