A Generic Method for Fine-grained Category Discovery in Natural Language Texts

Chang Tian; Matthew B. Blaschko; Wenpeng Yin; Mingzhe Xing; Yinliang Yue; Marie-Francine Moens

A Generic Method for Fine-grained Category Discovery in Natural Language Texts

Chang Tian, Matthew B. Blaschko, Wenpeng Yin, Mingzhe Xing, Yinliang Yue, Marie-Francine Moens

TL;DR

Fine-grained category discovery under coarse supervision remains challenging due to underutilized semantic structure. The authors propose STAR, a contrastive learning framework that uses comprehensive semantic similarities measured in a logarithmic space to guide sample distributions and form tight, separable fine-grained clusters, with a centroid-based inference mechanism for real-time tasks. Theoretical analysis links STAR to clustering and generalized EM, and empirical results on CLINC, WOS, and HWU64 show state-of-the-art performance across ACC, ARI, and NMI, with robust ablations and an evaluation of real-time inference. Overall, STAR provides a practical, principled approach to uncover fine-grained categories in text and supports deployment in latency-sensitive applications through centroid inference and dynamic neighborhood weighting.

Abstract

Fine-grained category discovery using only coarse-grained supervision is a cost-effective yet challenging task. Previous training methods focus on aligning query samples with positive samples and distancing them from negatives. They often neglect intra-category and inter-category semantic similarities of fine-grained categories when navigating sample distributions in the embedding space. Furthermore, some evaluation techniques that rely on pre-collected test samples are inadequate for real-time applications. To address these shortcomings, we introduce a method that successfully detects fine-grained clusters of semantically similar texts guided by a novel objective function. The method uses semantic similarities in a logarithmic space to guide sample distributions in the Euclidean space and to form distinct clusters that represent fine-grained categories. We also propose a centroid inference mechanism to support real-time applications. The efficacy of the method is both theoretically justified and empirically confirmed on three benchmark tasks. The proposed objective function is integrated in multiple contrastive learning based neural models. Its results surpass existing state-of-the-art approaches in terms of Accuracy, Adjusted Rand Index and Normalized Mutual Information of the detected fine-grained categories. Code and data will be available at Code and data are publicly available at https://github.com/changtianluckyforever/F-grained-STAR.

A Generic Method for Fine-grained Category Discovery in Natural Language Texts

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 4 figures, 5 tables)

This paper contains 26 sections, 7 equations, 4 figures, 5 tables.

Introduction
Related Work
Fine-grained Category Discovery
Neighborhood Contrastive Learning
Problem Formulation
Method
Multi-task Pre-training
Neighbors Retrieval and Weighting
Training
Objective Function
Loss Analysis
Inference
Experiments
Experimental Settings
Datasets
...and 11 more sections

Figures (4)

Figure 1: A fine-grained intent detection example. Left: This panel illustrates the label hierarchy, transitioning from coarse-grained to fine-grained granularity. Right: This example demonstrates intent detection in a conversation about car choices, showing how coarse-grained analysis alone can lead to incorrect recommendations by a life assistant due to a lack of fine-grained analysis.
Figure 2: Visualization of comprehensive semantic similarities (CSS). The wavy line indicates the bidirectional KL divergence between two samples.
Figure 3: STAR-DOWN integrates the baseline DOWN with the STAR method (shown in the red dashed box). In the visual representation, colors differentiate samples, squares represent features extracted by the Encoder, and circles denote features extracted by the Momentum Encoder. Unidirectional arrows indicate proximity, while bidirectional arrows signify distance between samples.
Figure 4: The t-SNE visualization of sample embeddings from STAR-DOWN method on the HWU64 dataset, with different colors representing different coarse-grained categories. The distinct clusters represent the discovered fine-grained categories.

A Generic Method for Fine-grained Category Discovery in Natural Language Texts

TL;DR

Abstract

A Generic Method for Fine-grained Category Discovery in Natural Language Texts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)