Table of Contents
Fetching ...

Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification

Nan Li, Bo Kang, Tijl De Bie

TL;DR

This paper tackles the problem of text classification with hierarchical labels (HTC) across multiple domains, addressing the gap that most work is domain-specific. It introduces a unified, domain-agnostic framework built from nine submodules and conducts a large cross-domain evaluation of eight representative HTC methods on eight datasets from five domains, re-implementing and standardizing data processing. Key findings show that dataset characteristics and architectural choices, rather than domain origin, largely drive performance, and that transferring submodules across domains can yield new state-of-the-art results (e.g., cross-domain gains on NYT-166, SciHTC-83, USPTO2M-632). The study also reveals that domain-specific LLMs help especially for simpler models and low-resource settings, long-document handling is critical for medical text, and that combining innovations from different domains can produce robust HTC systems with practical implications for cross-domain knowledge transfer and benchmark design.

Abstract

Text classification with hierarchical labels is a prevalent and challenging task in natural language processing. Examples include assigning ICD codes to patient records, tagging patents into IPC classes, assigning EUROVOC descriptors to European legal texts, and more. Despite its widespread applications, a comprehensive understanding of state-of-the-art methods across different domains has been lacking. In this paper, we provide the first comprehensive cross-domain overview with empirical analysis of state-of-the-art methods. We propose a unified framework that positions each method within a common structure to facilitate research. Our empirical analysis yields key insights and guidelines, confirming the necessity of learning across different research areas to design effective methods. Notably, under our unified evaluation pipeline, we achieved new state-of-the-art results by applying techniques beyond their original domains.

Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification

TL;DR

This paper tackles the problem of text classification with hierarchical labels (HTC) across multiple domains, addressing the gap that most work is domain-specific. It introduces a unified, domain-agnostic framework built from nine submodules and conducts a large cross-domain evaluation of eight representative HTC methods on eight datasets from five domains, re-implementing and standardizing data processing. Key findings show that dataset characteristics and architectural choices, rather than domain origin, largely drive performance, and that transferring submodules across domains can yield new state-of-the-art results (e.g., cross-domain gains on NYT-166, SciHTC-83, USPTO2M-632). The study also reveals that domain-specific LLMs help especially for simpler models and low-resource settings, long-document handling is critical for medical text, and that combining innovations from different domains can produce robust HTC systems with practical implications for cross-domain knowledge transfer and benchmark design.

Abstract

Text classification with hierarchical labels is a prevalent and challenging task in natural language processing. Examples include assigning ICD codes to patient records, tagging patents into IPC classes, assigning EUROVOC descriptors to European legal texts, and more. Despite its widespread applications, a comprehensive understanding of state-of-the-art methods across different domains has been lacking. In this paper, we provide the first comprehensive cross-domain overview with empirical analysis of state-of-the-art methods. We propose a unified framework that positions each method within a common structure to facilitate research. Our empirical analysis yields key insights and guidelines, confirming the necessity of learning across different research areas to design effective methods. Notably, under our unified evaluation pipeline, we achieved new state-of-the-art results by applying techniques beyond their original domains.

Paper Structure

This paper contains 48 sections, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Our framework and methods with submodule combinations.
  • Figure 2: General trends in model performance across different datasets. (a) Label space complexity: Performance decreases with larger label sizses. (b) Model architecture: LLM-based models with advanced learning strategies outperform simpler architectures. (c) Text-label information fusion: Sophisticated mechanisms for combining text and label information yield better results than basic approaches.
  • Figure 3: Correlations between dataset characteristics and model performance (precision@1). Only correlations with absolute values greater than 0.3 are shown, with features sorted top-down by absolute correlation values. The y-axis shows dataset features, where "#labels" refer to the total number of distinct classes, "Max/Min/Avg #labels" refers to statistics about how many labels each document has, i.e., mean, max, min number of labels per document and similarly "Max/Min/Avg #samples" refers to statistics about how many training examples each label has. Different colored bars show three performance metrics: mean, maximum, and minimum precision@1 scores across all models.
  • Figure 4: Performance changes from PLM-ICD to PLM-ICD+Label2Vec plotted against dataset characteristics. It shows that augmenting PLM-ICD with label semantic information is beneficial for datasets containing diverse and rare label combinations. The x-axis shows the average pattern IDF (measuring label combination diversity, see definition \ref{['item:pattern-idf']}), and the y-axis shows tail pattern coverage (proportion of samples with rare label combinations, see definition \ref{['item:tail-coverage']}). Each point represents a dataset, with larger improvements (shown by point size) occurring in datasets with both high IDF and tail coverage. MIMIC3-3681 results are shown for both BERT (B) and RoBERTa-pm (R) encoders, where RoBERTa-pm is the original text encoder used by PLM-ICD.
  • Figure 5: P/R@1 changes from BERT to domain-specific LLMs on MIMIC3-3681 and USPTO2M-632. The paired bars show performance improvements when switching from BERT to domain-specific LLMs. Larger gains are seen on MIMIC3-3681 compared to USPTO2M-632, especially for simpler architectures like FlatBERT. The horizontal dashed line indicates a new state-of-the-art achieved by PLM-ICD using RoBERTa-PM (a medical LLM) on USPTO2M-632 (patent), surprisingly outperforming SciBERT (a scientific LLM).