Table of Contents
Fetching ...

Review of Extreme Multilabel Classification

Arpan Dasgupta, Preeti Lamba, Ankita Kushwaha, Kiran Ravish, Siddhant Katyan, Shrutimoy Das, Pawan Kumar

TL;DR

This survey analyzes extreme multi-label classification (XMLC), addressing problems where the label space is enormous ($L$) and tail labels are data-scarce. It organizes methods into categories—Compressed-Sensing, Linear-Algebra, Tree-Based, One-vs-All, Deep-Learning, LLM-Assisted, and Multi-modal—highlighting core ideas like embedding the label space, distance-preserving encodings, and label trees to achieve scalable inference. It also standardizes evaluation through datasets and metrics such as $P@k$, $\text{DCG}@k$, $\text{nDCG}@k$, and propensity-based variants, with macro metrics to emphasize tail performance, and surveys across applications from document tagging to advertising. The analysis underscores that while low-rank embeddings enable scalability, tail labels demand alternative strategies (e.g., distance-preserving and per-cluster approaches) and that recent deep learning and multimodal methods, including LLM-assisted approaches, show strong potential for tail-label gains and real-world impact. Overall, the paper provides a comprehensive roadmap of XMLC techniques, benchmarks, and practical considerations that guide future research toward scalable, tail-aware, and multimodal extreme classification systems.

Abstract

Extreme multi-label classification or XMLC, is an active area of interest in machine learning. Compared to traditional multi-label classification, here the number of labels is extremely large, hence, the name extreme multi-label classification. Using classical one-versus-all classification does not scale in this case due to large number of labels; the same is true for any other classifier. Embedding labels and features into a lower-dimensional space is a common first step in many XMLC methods. Moreover, other issues include existence of head and tail labels, where tail labels are those that occur in a relatively small number of samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify correctly the prediction for head or tail labels.

Review of Extreme Multilabel Classification

TL;DR

This survey analyzes extreme multi-label classification (XMLC), addressing problems where the label space is enormous () and tail labels are data-scarce. It organizes methods into categories—Compressed-Sensing, Linear-Algebra, Tree-Based, One-vs-All, Deep-Learning, LLM-Assisted, and Multi-modal—highlighting core ideas like embedding the label space, distance-preserving encodings, and label trees to achieve scalable inference. It also standardizes evaluation through datasets and metrics such as , , , and propensity-based variants, with macro metrics to emphasize tail performance, and surveys across applications from document tagging to advertising. The analysis underscores that while low-rank embeddings enable scalability, tail labels demand alternative strategies (e.g., distance-preserving and per-cluster approaches) and that recent deep learning and multimodal methods, including LLM-assisted approaches, show strong potential for tail-label gains and real-world impact. Overall, the paper provides a comprehensive roadmap of XMLC techniques, benchmarks, and practical considerations that guide future research toward scalable, tail-aware, and multimodal extreme classification systems.

Abstract

Extreme multi-label classification or XMLC, is an active area of interest in machine learning. Compared to traditional multi-label classification, here the number of labels is extremely large, hence, the name extreme multi-label classification. Using classical one-versus-all classification does not scale in this case due to large number of labels; the same is true for any other classifier. Embedding labels and features into a lower-dimensional space is a common first step in many XMLC methods. Moreover, other issues include existence of head and tail labels, where tail labels are those that occur in a relatively small number of samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify correctly the prediction for head or tail labels.
Paper Structure (30 sections, 54 equations, 14 figures, 4 tables, 10 algorithms)

This paper contains 30 sections, 54 equations, 14 figures, 4 tables, 10 algorithms.

Figures (14)

  • Figure 1: The tail label distribution of $4$ popular XMLC datasets. The tail labels have very less frequency as compared to the most frequent (head) labels. Further, this imbalance grows with an increasing number of labels in the dataset.
  • Figure 2: Taxonomy of representative extreme multi-label classification (XMLC) methods. The XMLC methods are boradly classified into 6 classes. Due to popularity of Transformer/LLM based embeddings, we kept them as separate class.
  • Figure 3: General flow of CS based methods. Workflow of compressed-sensing (CS) methods for extreme multi-label classification: (i) the original, high-dimensional label vectors are first linearly compressed into a compact code space, producing a reduced output space; (ii) using the training instances, a predictor is learned that maps input features to codes in this reduced space, so only the small set of compressed targets is seen during optimisation; (iii) at test time the model outputs a code for each unseen instance; and (iv) a sparse-recovery or learned decoder reconstructs that code back to the full label space, yielding the final predictions—thus the three CS stages, compression $\rightarrow$ learning $\rightarrow$ reconstruction, make extreme-scale label prediction tractable while clearly marking where training data and test data enter the pipeline.
  • Figure 4: An example of the hypercube representation. Each axis represents whether a specific label is present. The entire label space is represented by the set of vertices.
  • Figure 5: Trellis encoding used by LTLS jasinska2016log. Vertex 0 is the source, vertices 1–6 form three successive layers of decision points, and vertex 7 is the sink. A classifier is attached to every directed edge; at test time the model chooses, at each layer, either the upper edge (interpret as bit "1") or the lower edge (bit "0"), so that the complete source→sink route is a 3-bit code that uniquely identifies one label among the $2^{3}=8$ possibilities: Example 1 – label 7 (111) Path $0 \!\to\! 1 \!\to\! 3 \!\to\! 6 \!\to\! 7$: the upper edge is taken at all three layers, yielding the binary string 111, i.e. label 7 in a zero-based enumeration. Example 2 – label 4 (100) Path $0 \!\to\! 1 \!\to\! 4 \!\to\! 5 \!\to\! 7$: upper edge at layer 1, lower edge at layers 2 and 3, producing 100, which encodes label 4. Because the trellis width grows only logarithmically with the number of labels, LTLS needs $O(\log L)$ classifiers and prediction time, yet can address every label through its unique path.
  • ...and 9 more figures