Table of Contents
Fetching ...

Harnessing Superclasses for Learning from Hierarchical Databases

Nicolas Urbani, Sylvain Rousseau, Yves Grandvalet, Leonardo Tanzi

TL;DR

This work introduces a loss for this type of supervised hierarchical classification that utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses, which allows for consistent classification objectives between superclasses and fine-grained classes.

Abstract

In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this type of supervised hierarchical classification. It utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses. Applicable to any feedforward architecture with a softmax output layer, this loss is a proper scoring rule, in that its expectation is minimized by the true posterior class probabilities. This property allows us to simultaneously pursue consistent classification objectives between superclasses and fine-grained classes, and eliminates the need for a performance trade-off between different granularities. We conduct an experimental study on three reference benchmarks, in which we vary the size of the training sets to cover a diverse set of learning scenarios. Our approach does not entail any significant additional computational cost compared with the loss of cross-entropy. It improves accuracy and reduces the number of coarse errors, with predicted labels that are distant from ground-truth labels in the tree.

Harnessing Superclasses for Learning from Hierarchical Databases

TL;DR

This work introduces a loss for this type of supervised hierarchical classification that utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses, which allows for consistent classification objectives between superclasses and fine-grained classes.

Abstract

In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this type of supervised hierarchical classification. It utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses. Applicable to any feedforward architecture with a softmax output layer, this loss is a proper scoring rule, in that its expectation is minimized by the true posterior class probabilities. This property allows us to simultaneously pursue consistent classification objectives between superclasses and fine-grained classes, and eliminates the need for a performance trade-off between different granularities. We conduct an experimental study on three reference benchmarks, in which we vary the size of the training sets to cover a diverse set of learning scenarios. Our approach does not entail any significant additional computational cost compared with the loss of cross-entropy. It improves accuracy and reduces the number of coarse errors, with predicted labels that are distant from ground-truth labels in the tree.

Paper Structure

This paper contains 26 sections, 4 theorems, 17 equations, 4 figures.

Key Result

proposition 1

The exponential weighting scheme of Equation eq:2 produces a balanced weighted tree, where the weights along the path from the root to any leaf sum up to 1/2, following an exponential growth/decay according to the value of parameter $q$, such that $w_j=q \, w_{p{\left(j\right)}}$ for all children of

Figures (4)

  • Figure 1: Illustration of a tree structure recalling the notation and weighting system
  • Figure 2: Average Wasserstein distance, hierarchical distance and standard accuracy for the two backbones as a function of the number of training images for the three benchmarks. The lines represent the average results on the replicates, and the colored areas represent $\pm1$ standard deviation.
  • Figure 3: Coarsening accuracy curves of accuracies vs. coarsening of the classification problem (the higher the better), for the two backbones and the three benchmarks. The lines represent the average results on the replicates, and the colored areas represent $\pm1$ standard deviation.
  • Figure 4: Wasserstein distance vs. classification accuracy with ResNet models trained on ImageNet with $6$ samples per class (best results in the bottom right-hand corner). Each point represents a training run, and hues correspond to the hyper-parameter settings, for which the crosses represent the average results. The lines underline the evolution of the average results of the methods as a function of their hyperparameter, in red for HXE, in green for ours.

Theorems & Definitions (10)

  • proposition 1
  • remark 1
  • proposition 2
  • proposition 3
  • proposition 4
  • proof
  • proof
  • proof
  • proof
  • proof