From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning
Ranran Haoran Zhang, Bensu Uçar, Soumik Dey, Hansi Wu, Binbin Li, Rui Zhang
TL;DR
Open-vocabulary XMC suffers from missing labels due to self-selection, causing generation models to undergenerate. The authors propose Positive-Unlabeled Sequence Learning (PUSL) to treat observed labels as positives and unobserved labels as unlabeled, enabling infinite keyphrase generation and post-training for diversity. They introduce F1@${\mathcal{O}}$ and BudgetAccuracy@${k}$ as faithful evaluation metrics under incomplete ground truth. Experiments on Ads-XMC and EURLex-4.3k demonstrate PUSL increases label diversity and alignment with user queries and provides robust results as label counts scale, highlighting practical impact for real-world OXMC.
Abstract
Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically $10^3$ to $10^{12}$ labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular inputs. This creates two critical challenges: generation models learn to be "lazy'" by under-generating labels, and evaluation becomes unreliable due to insufficient annotation in the test set. In this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task, addressing the generation model's laziness. Additionally, we propose to adopt a suite of evaluation metrics, F1@$\mathcal{O}$ and newly proposed B@$k$, to reliably assess OXMC models with incomplete ground truths. In a highly imbalanced e-commerce dataset with substantial missing labels, PUSL generates 30% more unique labels, and 72% of its predictions align with actual user queries. On the less skewed EURLex-4.3k dataset, PUSL demonstrates superior F1 scores, especially as label counts increase from 15 to 30. Our approach effectively tackles both the modeling and evaluation challenges in OXMC with missing labels.
