Table of Contents
Fetching ...

Revisiting Hierarchical Text Classification: Inference and Metrics

Roman Plaud, Matthieu Labeau, Antoine Saillenfest, Thomas Bonald

TL;DR

This work proposes to evaluate models based on specifically designed hierarchical metrics and demonstrates the intricacy of metric choice and prediction inference method.

Abstract

Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC. Code implementation and dataset are available at \url{https://github.com/RomanPlaud/revisitingHTC}.

Revisiting Hierarchical Text Classification: Inference and Metrics

TL;DR

This work proposes to evaluate models based on specifically designed hierarchical metrics and demonstrates the intricacy of metric choice and prediction inference method.

Abstract

Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC. Code implementation and dataset are available at \url{https://github.com/RomanPlaud/revisitingHTC}.
Paper Structure (41 sections, 4 theorems, 50 equations, 8 figures, 8 tables)

This paper contains 41 sections, 4 theorems, 50 equations, 8 figures, 8 tables.

Key Result

Proposition 1

In micro and samples settings, if every prediction $\hat{Y}$ is coherent, then hF1 and F1 are strictly equal.

Figures (8)

  • Figure 1: Extract of the taxonomy of our new dataset Hierarchical WikiVitals. Each colored path is the set of labels of the same color.
  • Figure 2: Example of a conditional distribution estimation over a simple hierarchy and corresponding predicted nodes (in blue) for different thresholds ($0.3$on the left, $0.5$on the right).
  • Figure 3: Averaged Macro F1-Scores on the test set per depth for different models and for the HWV dataset. The error bars represent a $95\%$ confidence interval.
  • Figure 4: Averaged Macro F1-Scores on the test set by quantiles of label counts distribution in the training set for different models and for the HWV dataset. The shaded regions represent a $95\%$ confidence interval.
  • Figure 5: Number of nodes per depth for HWV dataset. Hatched histogram correspond to leaf nodes.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4