Table of Contents
Fetching ...

Open-world Multi-label Text Classification with Extremely Weak Supervision

Xintong Li, Jinya Jiang, Ria Dharmani, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

TL;DR

This work tackles open-world multi-label text classification under extremely weak supervision by introducing X-MLClass, a framework that builds a practical label space and a zero-shot classifier using only a brief user description. It prompts an LLM to extract dominant keyphrases from document chunks, clusters these into an initial label space, and iteratively augments it with long-tail labels through a textual entailment-based classifier, achieving substantial label-space coverage. Across five benchmark datasets, X-MLClass delivers notably higher GT label-space coverage (up to ~40% improvements) and strong zero-shot accuracy, demonstrating practical viability for dynamic tagging with minimal supervision. The approach highlights the value of combining LLM-driven keyphrase generation, dimensionality reduction, clustering, and entailment-based ranking to enable scalable open-world MLTC with limited human input.

Abstract

We study open-world multi-label text classification under extremely weak supervision (XWS), where the user only provides a brief description for classification objectives without any labels or ground-truth label space. Similar single-label XWS settings have been explored recently, however, these methods cannot be easily adapted for multi-label. We observe that (1) most documents have a dominant class covering the majority of content and (2) long-tail labels would appear in some documents as a dominant class. Therefore, we first utilize the user description to prompt a large language model (LLM) for dominant keyphrases of a subset of raw documents, and then construct a (initial) label space via clustering. We further apply a zero-shot multi-label classifier to locate the documents with small top predicted scores, so we can revisit their dominant keyphrases for more long-tail labels. We iterate this process to discover a comprehensive label space and construct a multi-label classifier as a novel method, X-MLClass. X-MLClass exhibits a remarkable increase in ground-truth label space coverage on various datasets, for example, a 40% improvement on the AAPD dataset over topic modeling and keyword extraction methods. Moreover, X-MLClass achieves the best end-to-end multi-label classification accuracy.

Open-world Multi-label Text Classification with Extremely Weak Supervision

TL;DR

This work tackles open-world multi-label text classification under extremely weak supervision by introducing X-MLClass, a framework that builds a practical label space and a zero-shot classifier using only a brief user description. It prompts an LLM to extract dominant keyphrases from document chunks, clusters these into an initial label space, and iteratively augments it with long-tail labels through a textual entailment-based classifier, achieving substantial label-space coverage. Across five benchmark datasets, X-MLClass delivers notably higher GT label-space coverage (up to ~40% improvements) and strong zero-shot accuracy, demonstrating practical viability for dynamic tagging with minimal supervision. The approach highlights the value of combining LLM-driven keyphrase generation, dimensionality reduction, clustering, and entailment-based ranking to enable scalable open-world MLTC with limited human input.

Abstract

We study open-world multi-label text classification under extremely weak supervision (XWS), where the user only provides a brief description for classification objectives without any labels or ground-truth label space. Similar single-label XWS settings have been explored recently, however, these methods cannot be easily adapted for multi-label. We observe that (1) most documents have a dominant class covering the majority of content and (2) long-tail labels would appear in some documents as a dominant class. Therefore, we first utilize the user description to prompt a large language model (LLM) for dominant keyphrases of a subset of raw documents, and then construct a (initial) label space via clustering. We further apply a zero-shot multi-label classifier to locate the documents with small top predicted scores, so we can revisit their dominant keyphrases for more long-tail labels. We iterate this process to discover a comprehensive label space and construct a multi-label classifier as a novel method, X-MLClass. X-MLClass exhibits a remarkable increase in ground-truth label space coverage on various datasets, for example, a 40% improvement on the AAPD dataset over topic modeling and keyword extraction methods. Moreover, X-MLClass achieves the best end-to-end multi-label classification accuracy.
Paper Structure (36 sections, 2 equations, 3 figures, 6 tables)

This paper contains 36 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of our X-MLClass framework. The only required supervision from the user is a brief description of the classification objective. During the first LLM prompting stage for keyphrases, X-MLClass leverages this description as a part of the prompt, so it will be helpful if the description includes some demonstrations.
  • Figure 2: Coverage Improvement across Iterations.
  • Figure 3: Improvement of Label Coverage for Amazon-531 by increasing the number of iterations.