Table of Contents
Fetching ...

Towards Real-world Scenario: Imbalanced New Intent Discovery

Shun Zhang, Chaoran Yan, Jian Yang, Jiaheng Liu, Ying Mo, Jiaqi Bai, Tongliang Li, Zhoujun Li

TL;DR

Imbalanced New Intent Discovery (i-NID) tackles identifying known intents while clustering novel intents under long-tailed distributions in open-world dialogue. The authors propose ImbaNID, a three-stage framework that combines model pre-training, reliable pseudo-labeling via Relaxed Optimal Transport (ROT) with KL-based distribution constraints, and robust representation learning using distribution-aware and quality-aware regularization plus class-/instance-wise contrastive clustering. A new ImbaNID-Bench benchmark (CLINC150-LT, BANKING77-LT, StackOverflow20-LT) simulates real-world imbalances and demonstrates state-of-the-art performance, particularly for tail classes. Overall, the work provides a principled, scalable baseline for i-NID, enabling more effective discovery and categorization of user intents in highly imbalanced data settings.

Abstract

New Intent Discovery (NID) aims at detecting known and previously undefined categories of user intent by utilizing limited labeled and massive unlabeled data. Most prior works often operate under the unrealistic assumption that the distribution of both familiar and new intent classes is uniform, overlooking the skewed and long-tailed distributions frequently encountered in real-world scenarios. To bridge the gap, our work introduces the imbalanced new intent discovery (i-NID) task, which seeks to identify familiar and novel intent categories within long-tailed distributions. A new benchmark (ImbaNID-Bench) comprised of three datasets is created to simulate the real-world long-tail distributions. ImbaNID-Bench ranges from broad cross-domain to specific single-domain intent categories, providing a thorough representation of practical use cases. Besides, a robust baseline model ImbaNID is proposed to achieve cluster-friendly intent representations. It includes three stages: model pre-training, generation of reliable pseudo-labels, and robust representation learning that strengthens the model performance to handle the intricacies of real-world data distributions. Our extensive experiments on previous benchmarks and the newly established benchmark demonstrate the superior performance of ImbaNID in addressing the i-NID task, highlighting its potential as a powerful baseline for uncovering and categorizing user intents in imbalanced and long-tailed distributions\footnote{\url{https://github.com/Zkdc/i-NID}}.

Towards Real-world Scenario: Imbalanced New Intent Discovery

TL;DR

Imbalanced New Intent Discovery (i-NID) tackles identifying known intents while clustering novel intents under long-tailed distributions in open-world dialogue. The authors propose ImbaNID, a three-stage framework that combines model pre-training, reliable pseudo-labeling via Relaxed Optimal Transport (ROT) with KL-based distribution constraints, and robust representation learning using distribution-aware and quality-aware regularization plus class-/instance-wise contrastive clustering. A new ImbaNID-Bench benchmark (CLINC150-LT, BANKING77-LT, StackOverflow20-LT) simulates real-world imbalances and demonstrates state-of-the-art performance, particularly for tail classes. Overall, the work provides a principled, scalable baseline for i-NID, enabling more effective discovery and categorization of user intents in highly imbalanced data settings.

Abstract

New Intent Discovery (NID) aims at detecting known and previously undefined categories of user intent by utilizing limited labeled and massive unlabeled data. Most prior works often operate under the unrealistic assumption that the distribution of both familiar and new intent classes is uniform, overlooking the skewed and long-tailed distributions frequently encountered in real-world scenarios. To bridge the gap, our work introduces the imbalanced new intent discovery (i-NID) task, which seeks to identify familiar and novel intent categories within long-tailed distributions. A new benchmark (ImbaNID-Bench) comprised of three datasets is created to simulate the real-world long-tail distributions. ImbaNID-Bench ranges from broad cross-domain to specific single-domain intent categories, providing a thorough representation of practical use cases. Besides, a robust baseline model ImbaNID is proposed to achieve cluster-friendly intent representations. It includes three stages: model pre-training, generation of reliable pseudo-labels, and robust representation learning that strengthens the model performance to handle the intricacies of real-world data distributions. Our extensive experiments on previous benchmarks and the newly established benchmark demonstrate the superior performance of ImbaNID in addressing the i-NID task, highlighting its potential as a powerful baseline for uncovering and categorizing user intents in imbalanced and long-tailed distributions\footnote{\url{https://github.com/Zkdc/i-NID}}.
Paper Structure (39 sections, 25 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 25 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of proposed i-NID task: (a) i-NID unifies open-world and long-tail learning paradigms; (b) i-NID uses labeled and unlabeled data following a long-tail distribution to identify and categorize user intents.
  • Figure 2: Number of training samples per class in artificially created long-tailed CLINC150-LT datasets with different imbalance factors.
  • Figure 3: Overview of ImbaNID. The relaxed optimal transport (ROT) technique is used to produce high-quality pseudo-labels. Distribution-aware regularization (DR) and quality-aware regularization (QR) aim at filtering clean pseudo-labels. Finally, our framework incorporates class-wise contrastive learning (CWCL) and instance-wise contrastive learning (IWCL) to embed the data into a representation space where similar samples cluster together.
  • Figure 4: Head, Medium, and Tail comparison on the ImbaNID-Bench datasets.
  • Figure 5: t-SNE visualization of embeddings on the StackOverflow20-LT dataset. The known class ratio $|\mathcal{Y}^{k}|/|\mathcal{Y}^{k} \cap \mathcal{Y}^{n}|$ is 0.75, and the labeled ratio is 0.1.
  • ...and 4 more figures