Table of Contents
Fetching ...

Scalable Label Distribution Learning for Multi-Label Classification

Xingyu Zhao, Yuexuan An, Lei Qi, Xin Geng

TL;DR

SLDL addresses scalable multi-label classification by embedding labels as low-dimensional Gaussian distributions in a latent space, capturing asymmetric label correlations through a probability transfer matrix. It learns a feature-to-embedding mapping with $\mathcal{L}(\boldsymbol{W}) = \| \boldsymbol{Z} - \boldsymbol{X}\boldsymbol{W} \|_F^2 + \alpha \|\boldsymbol{W}\|_F^2$ optimized via L-BFGS, and decodes predictions with a cosine-based nearest-neighbor mechanism that weights neighboring embeddings. The approach provides a theoretical bound linking embedding and regression errors to the final cost and demonstrates strong empirical performance across 15 large-scale MLC benchmarks, achieving both high accuracy and substantial speedups. By decoupling complexity from the number of labels and exploiting asymmetric label relations, SLDL offers a scalable and effective framework for real-world large-output-space MLC tasks.

Abstract

Multi-label classification (MLC) refers to the problem of tagging a given instance with a set of relevant labels. Most existing MLC methods are based on the assumption that the correlation of two labels in each label pair is symmetric, which is violated in many real-world scenarios. Moreover, most existing methods design learning processes associated with the number of labels, which makes their computational complexity a bottleneck when scaling up to large-scale output space. To tackle these issues, we propose a novel method named Scalable Label Distribution Learning (SLDL) for multi-label classification which can describe different labels as distributions in a latent space, where the label correlation is asymmetric and the dimension is independent of the number of labels. Specifically, SLDL first converts labels into continuous distributions within a low-dimensional latent space and leverages the asymmetric metric to establish the correlation between different labels. Then, it learns the mapping from the feature space to the latent space, resulting in the computational complexity is no longer related to the number of labels. Finally, SLDL leverages a nearest-neighbor-based strategy to decode the latent representations and obtain the final predictions. Extensive experiments illustrate that SLDL achieves very competitive classification performances with little computational consumption.

Scalable Label Distribution Learning for Multi-Label Classification

TL;DR

SLDL addresses scalable multi-label classification by embedding labels as low-dimensional Gaussian distributions in a latent space, capturing asymmetric label correlations through a probability transfer matrix. It learns a feature-to-embedding mapping with optimized via L-BFGS, and decodes predictions with a cosine-based nearest-neighbor mechanism that weights neighboring embeddings. The approach provides a theoretical bound linking embedding and regression errors to the final cost and demonstrates strong empirical performance across 15 large-scale MLC benchmarks, achieving both high accuracy and substantial speedups. By decoupling complexity from the number of labels and exploiting asymmetric label relations, SLDL offers a scalable and effective framework for real-world large-output-space MLC tasks.

Abstract

Multi-label classification (MLC) refers to the problem of tagging a given instance with a set of relevant labels. Most existing MLC methods are based on the assumption that the correlation of two labels in each label pair is symmetric, which is violated in many real-world scenarios. Moreover, most existing methods design learning processes associated with the number of labels, which makes their computational complexity a bottleneck when scaling up to large-scale output space. To tackle these issues, we propose a novel method named Scalable Label Distribution Learning (SLDL) for multi-label classification which can describe different labels as distributions in a latent space, where the label correlation is asymmetric and the dimension is independent of the number of labels. Specifically, SLDL first converts labels into continuous distributions within a low-dimensional latent space and leverages the asymmetric metric to establish the correlation between different labels. Then, it learns the mapping from the feature space to the latent space, resulting in the computational complexity is no longer related to the number of labels. Finally, SLDL leverages a nearest-neighbor-based strategy to decode the latent representations and obtain the final predictions. Extensive experiments illustrate that SLDL achieves very competitive classification performances with little computational consumption.
Paper Structure (22 sections, 1 theorem, 27 equations, 6 figures, 12 tables, 3 algorithms)

This paper contains 22 sections, 1 theorem, 27 equations, 6 figures, 12 tables, 3 algorithms.

Key Result

Theorem 1

For any sample $\left(\boldsymbol{x},\boldsymbol{y}\right)$, let $\boldsymbol{z}$ be the embedding vector of $\boldsymbol{y}$, $\boldsymbol{\hat{z}}$ be the predicted embedding vector, and $\boldsymbol{\tilde{z}}$ be the nearest embedding vector of $\boldsymbol{\hat{z}}$. Then the following bound ho where $\mathcal{E}\left(\cdot, \cdot\right)$ denotes Euclidean distance and $b > 1$ is a constant.

Figures (6)

  • Figure 1: An illustration of exemplar images and their corresponding labels. The correlation of "sky" and "cloud" are asymmetric. Specifically, if "cloud" appears in an instance, then "sky" also appears; while if "sky" appears, "cloud" may not necessarily appear.
  • Figure 2: The schematic diagram of SLDL. The whole process of SLDL can be divided into three stages: (1) target embedding: transform the label vectors into low-dimensional embedding vectors, where the dimension of the target space is reduced and the asymmetric label correlations are constructed; (2) feature mapping: learn a mapping function from feature vectors to embedding vectors, where the computational complexity is no longer related to the number of labels; (3) decoding: map the target embedding vector to the predicted label vector.
  • Figure 3: Comparison of SLDL (control algorithm) against comparing algorithms with the Nemenyi test. Algorithms not connected with SLDL in the CD diagram are considered to have a significantly different performance from the control algorithm.
  • Figure 4: Training time of AdaC2, SLEEC, FLEM, KD-TEA, and SLDL on different datasets. In each subfigure, the x-axis indicates the MLC method, the y-axis indicates the training time ($s$).
  • Figure 5: Effects of $\hat{c}$ on P@$1$ and P@$5$. In each subfigure, the x-axis indicates the value of $\hat{c}$, the y-axis indicates the value of P@$1$ and P@$5$.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof