Table of Contents
Fetching ...

Consistent algorithms for multi-label classification with macro-at-$k$ metrics

Erik Schultheis, Wojciech Kotłowski, Marek Wydmuch, Rohit Babbar, Strom Borman, Krzysztof Dembczyński

TL;DR

The paper addresses optimizing complex macro-at-$k$ metrics in multi-label classification under budgeted predictions. It shows that for linear macro-utilities the optimal rule reduces to top-$k$ labels after an affine transform of label marginals, and it develops a Frank-Wolfe–based, statistically consistent learning algorithm that extends to nonlinear metrics via gradient-based linearization. Theoretical contributions establish the existence and form of the optimal confusion tensor and provide convergence guarantees for the proposed algorithm, including a regret bound that accounts for marginal-estimation error. Empirically, the approach yields competitive macro-measures on extreme multi-label benchmarks and scales to thousands of labels, with practical considerations like sparse marginals and tail-label sensitivity highlighted. Overall, the work provides a principled, scalable framework for consistent optimization of complex macro-at-$k$ metrics in budgeted multi-label problems.

Abstract

We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly $k$ labels predicted for each instance. These "macro-at-$k$" metrics possess desired properties for extreme classification problems with long tail labels. Unfortunately, the at-$k$ constraint couples the otherwise independent binary classification tasks, leading to a much more challenging optimization problem than standard macro-averages. We provide a statistical framework to study this problem, prove the existence and the form of the optimal classifier, and propose a statistically consistent and practical learning algorithm based on the Frank-Wolfe method. Interestingly, our main results concern even more general metrics being non-linear functions of label-wise confusion matrices. Empirical results provide evidence for the competitive performance of the proposed approach.

Consistent algorithms for multi-label classification with macro-at-$k$ metrics

TL;DR

The paper addresses optimizing complex macro-at- metrics in multi-label classification under budgeted predictions. It shows that for linear macro-utilities the optimal rule reduces to top- labels after an affine transform of label marginals, and it develops a Frank-Wolfe–based, statistically consistent learning algorithm that extends to nonlinear metrics via gradient-based linearization. Theoretical contributions establish the existence and form of the optimal confusion tensor and provide convergence guarantees for the proposed algorithm, including a regret bound that accounts for marginal-estimation error. Empirically, the approach yields competitive macro-measures on extreme multi-label benchmarks and scales to thousands of labels, with practical considerations like sparse marginals and tail-label sensitivity highlighted. Overall, the work provides a principled, scalable framework for consistent optimization of complex macro-at- metrics in budgeted multi-label problems.

Abstract

We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly labels predicted for each instance. These "macro-at-" metrics possess desired properties for extreme classification problems with long tail labels. Unfortunately, the at- constraint couples the otherwise independent binary classification tasks, leading to a much more challenging optimization problem than standard macro-averages. We provide a statistical framework to study this problem, prove the existence and the form of the optimal classifier, and propose a statistically consistent and practical learning algorithm based on the Frank-Wolfe method. Interestingly, our main results concern even more general metrics being non-linear functions of label-wise confusion matrices. Empirical results provide evidence for the competitive performance of the proposed approach.
Paper Structure (26 sections, 21 theorems, 81 equations, 1 figure, 8 tables, 1 algorithm)

This paper contains 26 sections, 21 theorems, 81 equations, 1 figure, 8 tables, 1 algorithm.

Key Result

theorem 4.1

The optimal classifier $\optimal\hypothesis \coloneqq \argmax_{\hypothesis \in \hypothesisspace} \taskloss(\hypothesis)$ for $\taskloss(\hypothesis) = \gaintensor \cdot \confusiontensor(\hypothesis)$ is given by where $\odot$ denotes the coordinate-wise product of vectors, while the vectors $\gainslope$ and $\gainintercept$ are given by: and $\operatorname{top}_{k}(\examplevec)$ returns a $k$-ho

Figures (1)

  • Figure 1: Comparison of the baseline algorithms with the PU inference with mixed objectives for $k \in \{3, 5, 10\}$. The green line shows the results for different interpolations between two measures.

Theorems & Definitions (45)

  • Definition 3.0: Binary Confusion Matrix Measure
  • Definition 3.0: Confusion Tensor Measure
  • theorem 4.1
  • proof : Proof (sketch, full proof in Appendix \ref{['app:linear_metric']})
  • theorem 4.4
  • proof : Proof (sketch, full proof in Appendix \ref{['app:the_optimal_classifier']}
  • theorem 5.1: Consistency of Frank-Wolfe
  • lemma 5.1: VC dimension for linear top-k classifiers
  • theorem A.1
  • proof
  • ...and 35 more