Table of Contents
Fetching ...

Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification

Andreas Grivas, Antonio Vergari, Adam Lopez

TL;DR

The paper investigates the bottleneck drawbacks of sigmoid output layers in large-scale multi-label classification (MLC), showing that low-rank weight matrices can make many label combinations unargmaxable. It introduces a DFT-based output layer that, by constraining the weight matrix to the Grassmannian ${\mathsf{Gr}}^{+}_{n,2k+1}$, guarantees argmaxability for all label sets with at most $k$ active labels, and connects this to $2k$-alternating sign structures. The authors provide theoretical guarantees, discuss practical issues like tiny decision regions, and propose slack-variable extensions to preserve argmaxability while enlarging feasible regions. Empirically, the DFT layer achieves comparable or better F1@k scores with up to 50% fewer trainable parameters and faster convergence across three major MLC datasets, while avoiding unargmaxable outputs that plague standard BSLs. The work highlights that guaranteeing argmaxability can improve reliability and robustness in safety-critical and long-tail label scenarios, and suggests directions for extending these guarantees to other output-layer families.

Abstract

Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a low-rank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to $k$ active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck.

Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification

TL;DR

The paper investigates the bottleneck drawbacks of sigmoid output layers in large-scale multi-label classification (MLC), showing that low-rank weight matrices can make many label combinations unargmaxable. It introduces a DFT-based output layer that, by constraining the weight matrix to the Grassmannian , guarantees argmaxability for all label sets with at most active labels, and connects this to -alternating sign structures. The authors provide theoretical guarantees, discuss practical issues like tiny decision regions, and propose slack-variable extensions to preserve argmaxability while enlarging feasible regions. Empirically, the DFT layer achieves comparable or better F1@k scores with up to 50% fewer trainable parameters and faster convergence across three major MLC datasets, while avoiding unargmaxable outputs that plague standard BSLs. The work highlights that guaranteeing argmaxability can improve reliability and robustness in safety-critical and long-tail label scenarios, and suggests directions for extending these guarantees to other output-layer families.

Abstract

Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a low-rank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck.
Paper Structure (71 sections, 7 theorems, 31 equations, 15 figures, 5 tables)

This paper contains 71 sections, 7 theorems, 31 equations, 15 figures, 5 tables.

Key Result

Theorem 1

Cover1965 If $\mathbf{W}$ is in general position, the number of argmaxable label combinations is:

Figures (15)

  • Figure 1: When we have more labels ($n$) than features ($d$), some label combinations are unargmaxable, i.e. impossible to predict. Left: in a $d=2$ feature space with $n=3$ classification hyperplanes through the origin, only 6 out of 8 label combinations can be predicted irrespective of how we orient the hyperplanes. Right: a BSL trained on the MIMIC-III clinical MLC dataset with $d=50$ and $n=8921$ is unable to predict this label combination which has the depicted $7$ active labels $(+)$ and the remaining ones are inactive $(-)$.
  • Figure 2: We log-plot what percentage of the $2^{1000}$ label combinations is argmaxable for a BSL with $n=1000$ labels as we decrease the feature dimensionality $d$ (right to left). When $d \ll n$ we can represent exponentially fewer outputs. We split the y-axis to highlight the fast dip when $d<500$.
  • Figure 3: Our $n=3,d=2$ example from \ref{['fig:problem']}. We include the balls found by the Chebyshev LP for each argmaxable label combination. When $d \ll n$, most balls will have a tiny radius.
  • Figure 4: Visual evidence of \ref{['thm:altfeas']}. a) We construct a BSL having $n=4$ labels and $d=2$ features parametrised by $\mathbf{W} \in \mathbb{R}^{4 \times 2}$ such that all maximal minors are positive, i.e. $\mathbf{W} \in {\mathsf{Gr}}^{+}_{n=4, d=2}$. (b) The rows of the matrix are binary classifiers, we demarcate the decision boundaries for each classifier using a dashed line. (c) We assign each region a sign vector corresponding to which labels the BSL would flag as active for an input falling in that region. As per \ref{['thm:altfeas']}, exactly the $(d-1) = 1$-alternating outputs are argmaxable. More generally, for $d=2k+1$, all $k$-active outputs are argmaxable (see \ref{['app:3d']}).
  • Figure 5: Left: As we increase the number of labels $n$ for the DFT Layer, the radii of the regions shrink, making them harder to predict in practice. Right: Adding slack variables ameliorates this problem. We plot $\epsilon$-argmaxability (\ref{['def:eargmax']}), measured here for the 1% of labels that have radius less than that plotted. For the DFT Layer, i.e. $\mathbf{W}=\bm{W}^{\mathsf{DFT}}_{n, 2k+1}$, all $k$-active label assignments are argmaxable, but as we increase $n$, some (see $k \geq 3$) cannot be detected at the precision of the LP ($10^{-8}$). Adding 16 randomly initialised slack columns, i.e. $\mathbf{W} = \bm{W}^{\mathsf{DFT}}_{n, 2k+1}\,\,\mathbf{S}$, makes the regions $\epsilon$-argmaxable with larger $\epsilon$.
  • ...and 10 more figures

Theorems & Definitions (20)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Lemma 1
  • Theorem 2
  • ...and 10 more