Consistent algorithms for multi-label classification with macro-at-$k$ metrics
Erik Schultheis, Wojciech Kotłowski, Marek Wydmuch, Rohit Babbar, Strom Borman, Krzysztof Dembczyński
TL;DR
The paper addresses optimizing complex macro-at-$k$ metrics in multi-label classification under budgeted predictions. It shows that for linear macro-utilities the optimal rule reduces to top-$k$ labels after an affine transform of label marginals, and it develops a Frank-Wolfe–based, statistically consistent learning algorithm that extends to nonlinear metrics via gradient-based linearization. Theoretical contributions establish the existence and form of the optimal confusion tensor and provide convergence guarantees for the proposed algorithm, including a regret bound that accounts for marginal-estimation error. Empirically, the approach yields competitive macro-measures on extreme multi-label benchmarks and scales to thousands of labels, with practical considerations like sparse marginals and tail-label sensitivity highlighted. Overall, the work provides a principled, scalable framework for consistent optimization of complex macro-at-$k$ metrics in budgeted multi-label problems.
Abstract
We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly $k$ labels predicted for each instance. These "macro-at-$k$" metrics possess desired properties for extreme classification problems with long tail labels. Unfortunately, the at-$k$ constraint couples the otherwise independent binary classification tasks, leading to a much more challenging optimization problem than standard macro-averages. We provide a statistical framework to study this problem, prove the existence and the form of the optimal classifier, and propose a statistically consistent and practical learning algorithm based on the Frank-Wolfe method. Interestingly, our main results concern even more general metrics being non-linear functions of label-wise confusion matrices. Empirical results provide evidence for the competitive performance of the proposed approach.
