Dual-Encoders for Extreme Multi-Label Classification

Nilesh Gupta; Devvrit Khatri; Ankit S Rawat; Srinadh Bhojanapalli; Prateek Jain; Inderjit Dhillon

Dual-Encoders for Extreme Multi-Label Classification

Nilesh Gupta, Devvrit Khatri, Ankit S Rawat, Srinadh Bhojanapalli, Prateek Jain, Inderjit Dhillon

TL;DR

This work demonstrates that dual-encoder models can achieve state-of-the-art or competitive performance on extreme multi-label classification tasks when equipped with a loss designed for multi-label, many-shot settings. The Decoupled Softmax loss and SoftTop-k variant, together with a memory-efficient gradient-cache training pipeline, enable effective learning over millions of labels without per-label classifiers. Empirical results on multiple large XMC benchmarks show substantial improvements in top-k accuracy with far fewer trainable parameters compared to existing methods, and ablations highlight the importance of negative sampling strategy. The findings suggest a unified, parameter-efficient approach to retrieval and XMC, with practical implications for scalable search and recommendation systems.

Abstract

Dual-encoder (DE) models are widely used in retrieval tasks, most commonly studied on open QA benchmarks that are often characterized by multi-class and limited training data. In contrast, their performance in multi-label and data-rich retrieval settings like extreme multi-label classification (XMC), remains under-explored. Current empirical evidence indicates that DE models fall significantly short on XMC benchmarks, where SOTA methods linearly scale the number of learnable parameters with the total number of classes (documents in the corpus) by employing per-class classification head. To this end, we first study and highlight that existing multi-label contrastive training losses are not appropriate for training DE models on XMC tasks. We propose decoupled softmax loss - a simple modification to the InfoNCE loss - that overcomes the limitations of existing contrastive losses. We further extend our loss design to a soft top-k operator-based loss which is tailored to optimize top-k prediction performance. When trained with our proposed loss functions, standard DE models alone can match or outperform SOTA methods by up to 2% at Precision@1 even on the largest XMC datasets while being 20x smaller in terms of the number of trainable parameters. This leads to more parameter-efficient and universally applicable solutions for retrieval tasks. Our code and models are publicly available at https://github.com/nilesh2797/dexml.

Dual-Encoders for Extreme Multi-Label Classification

TL;DR

Abstract

Paper Structure (37 sections, 12 equations, 8 figures, 16 tables)

This paper contains 37 sections, 12 equations, 8 figures, 16 tables.

Introduction
Related work
Background: multi-label classification
Improved training of dual-encoder models
Limitations of standard contrastive loss functions for extreme multi-label problems
Memory Efficient Training
Proposed differentiable top-k operator-based loss
Experiments
Datasets and evaluation
Baselines and setup
Comparison with XMC methods
$\mathsf{\small Decoupled Softmax}$ with Hard Negatives
Comparison across different loss functions
Conclusions & Limitations
Future Work and Reproducibility
...and 22 more sections

Figures (8)

Figure 1: Number of trainable parameters used by different models and their Precision@1 performance on LF-AmazonTitles-1.3M dataset Bhatia16
Figure 2: Decoupled softmax vs standard softmax on synthetic dataset.
Figure 3: Gradient analysis of two labels [left] "afghanistan" (cherry-picked) and [right]"data_transmission" (randomly-chosen) on EURLex-4K dataset for all positive training queries encountered during training.
Figure 4: Illustration of distributed implementation with gradient caching applied on label embedding computation. Here solid black line indicate forward pass direction and solid red lines indicate gradient backpropagation direction. $L$ is the number of labels considered in the loss computation, $B$ is the batch size of queries, $\eta$ is a micro batch-size hyperparameter which controls how many labels are processed at a time.
Figure 5: PR curve for Decoupled softmax vs standard softmax on EURLex-4K dataset
...and 3 more figures

Dual-Encoders for Extreme Multi-Label Classification

TL;DR

Abstract

Dual-Encoders for Extreme Multi-Label Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (8)