Table of Contents
Fetching ...

From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning

Ranran Haoran Zhang, Bensu Uçar, Soumik Dey, Hansi Wu, Binbin Li, Rui Zhang

TL;DR

Open-vocabulary XMC suffers from missing labels due to self-selection, causing generation models to undergenerate. The authors propose Positive-Unlabeled Sequence Learning (PUSL) to treat observed labels as positives and unobserved labels as unlabeled, enabling infinite keyphrase generation and post-training for diversity. They introduce F1@${\mathcal{O}}$ and BudgetAccuracy@${k}$ as faithful evaluation metrics under incomplete ground truth. Experiments on Ads-XMC and EURLex-4.3k demonstrate PUSL increases label diversity and alignment with user queries and provides robust results as label counts scale, highlighting practical impact for real-world OXMC.

Abstract

Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically $10^3$ to $10^{12}$ labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular inputs. This creates two critical challenges: generation models learn to be "lazy'" by under-generating labels, and evaluation becomes unreliable due to insufficient annotation in the test set. In this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task, addressing the generation model's laziness. Additionally, we propose to adopt a suite of evaluation metrics, F1@$\mathcal{O}$ and newly proposed B@$k$, to reliably assess OXMC models with incomplete ground truths. In a highly imbalanced e-commerce dataset with substantial missing labels, PUSL generates 30% more unique labels, and 72% of its predictions align with actual user queries. On the less skewed EURLex-4.3k dataset, PUSL demonstrates superior F1 scores, especially as label counts increase from 15 to 30. Our approach effectively tackles both the modeling and evaluation challenges in OXMC with missing labels.

From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning

TL;DR

Open-vocabulary XMC suffers from missing labels due to self-selection, causing generation models to undergenerate. The authors propose Positive-Unlabeled Sequence Learning (PUSL) to treat observed labels as positives and unobserved labels as unlabeled, enabling infinite keyphrase generation and post-training for diversity. They introduce F1@ and BudgetAccuracy@ as faithful evaluation metrics under incomplete ground truth. Experiments on Ads-XMC and EURLex-4.3k demonstrate PUSL increases label diversity and alignment with user queries and provides robust results as label counts scale, highlighting practical impact for real-world OXMC.

Abstract

Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically to labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular inputs. This creates two critical challenges: generation models learn to be "lazy'" by under-generating labels, and evaluation becomes unreliable due to insufficient annotation in the test set. In this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task, addressing the generation model's laziness. Additionally, we propose to adopt a suite of evaluation metrics, F1@ and newly proposed B@, to reliably assess OXMC models with incomplete ground truths. In a highly imbalanced e-commerce dataset with substantial missing labels, PUSL generates 30% more unique labels, and 72% of its predictions align with actual user queries. On the less skewed EURLex-4.3k dataset, PUSL demonstrates superior F1 scores, especially as label counts increase from 15 to 30. Our approach effectively tackles both the modeling and evaluation challenges in OXMC with missing labels.
Paper Structure (40 sections, 5 equations, 5 figures, 9 tables, 2 algorithms)

This paper contains 40 sections, 5 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: Real annotators often provide limited labels (e.g., Switch, Switch Zelda Edition), while many potential labels remain uncaptured (e.g., OLED Switch, Special Switch). This gap between observed and expected labels misleads generation models to be lazy by prematurely terminating label generation. Our proposed PUSL resolves the model laziness problem and can learn from incomplete ground truth.
  • Figure 2: Analysis of Missing Labels in OXMC.
  • Figure 3: Comparison of biases in keyphrase generation models: One2Seq's early termination, One2One's over-generation, and PUSL.
  • Figure 4: Comparison of evaluation metrics for lazy and prolific models under incomplete ground truth. Ground truth beyond K are truncated in the denominator. Traditional P@K favors lazy models even when all predictions are relevant. Proposed P@O equalizes performance between lazy and prolific models, while B@K penalizes under-generation. These metrics provide a more faithful evaluation with missing ground truth.
  • Figure 5: Case study on One2One passing the Uni (user query universe) evaluation but failing on Human evaluation. The item title is "Joystick Rocker Cap Buttons Cover Thumb Stick Grip Cap for PS5 DualSense Edge," but One2One predicts it as "thumb grip." While the human annotator considers "thumb grip" as a made-up term, buyers may actually use it to search for joystick caps. This highlights the complexity of evaluating search term predictions, as informal or colloquial terms might be practically useful despite not being formally correct.