Table of Contents
Fetching ...

One Size Does Not Fit All: Exploring Variable Thresholds for Distance-Based Multi-Label Text Classification

Jens Van Nooten, Andriy Kosar, Guy De Pauw, Walter Daelemans

TL;DR

This work investigates distance-based multi-label text classification with a focus on per-label thresholding. By analyzing similarity distributions across multiple encoder models and domains, the authors demonstrate substantial variability that undermines universal thresholds. They propose a simple, data-efficient method to learn label-specific thresholds on a validation set, yielding significant macro-$F_1$ improvements and often competing with zero-shot LLMs. The findings have practical implications for efficient, scalable MLTC and information retrieval tasks, with potential extensions to other modalities and domains.

Abstract

Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, few-shot, and fine-tuned neural networks that require re-training in such cases. In multi-label distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is "similar" to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.

One Size Does Not Fit All: Exploring Variable Thresholds for Distance-Based Multi-Label Text Classification

TL;DR

This work investigates distance-based multi-label text classification with a focus on per-label thresholding. By analyzing similarity distributions across multiple encoder models and domains, the authors demonstrate substantial variability that undermines universal thresholds. They propose a simple, data-efficient method to learn label-specific thresholds on a validation set, yielding significant macro- improvements and often competing with zero-shot LLMs. The findings have practical implications for efficient, scalable MLTC and information retrieval tasks, with potential extensions to other modalities and domains.

Abstract

Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, few-shot, and fine-tuned neural networks that require re-training in such cases. In multi-label distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is "similar" to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.

Paper Structure

This paper contains 31 sections, 3 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Overview of the thresholding approach and Inference Stage. Texts and label representations are embedded, after which the cosine similarity is used to determine label relevance. Thresholds are optimized based on performance on a validation set.
  • Figure 2: Bar charts showing the variation in (non-normalized) cosine similarity distributions between individual models and domains.
  • Figure 3: Results from learning curve experiments with the best performing model on each dataset (GIST-Large for SemEval, GTE-Large for BioTech and Stella for all other datasets). The dashed lines show the results conducted with uniform fine-tuned thresholds. Five random samples for each size are taken.
  • Figure 4: Distributions of cosine similarity scores per label of the LitCovid dataset, obtained using Stella.
  • Figure 5: Distributions of cosine similarity Scores per label of the BioTech dataset, obtained using GIST-Large.
  • ...and 3 more figures