Table of Contents
Fetching ...

Retrieval-augmented Encoders for Extreme Multi-label Text Classification

Yau-Shian Wang, Wei-Cheng Chang, Jyun-Yu Jiang, Jiong Zhang, Hsiang-Fu Yu, S. V. N. Vishwanathan

TL;DR

Extreme multi-label classification with millions of labels benefits from both memorization and generalization. RAExMC couples a dual-encoder with a retrieval-augmented memory, performing retrieve-then-predict over a joint knowledge memory and using a lightweight predictor to map retrieved keys to label scores via $\hat{\mathbf{p}} = \text{Softmax}(\mathbf{q}^\top \mathbf{K}^\top / \tau) \mathbf{V}$. Training relies on a decoupled, contrastive loss with in-batch negatives, while inference retrieves top-$b$ keys with ANN and aggregates their values, achieving state-of-the-art results among DE methods and significant speedups on large LF-XMC benchmarks. The framework offers a tunable balance between memorization and generalization through a memory-aggregation parameter $\lambda$, enabling flexible adaptation to head versus tail label performance, though it introduces additional inference overhead from the knowledge memory. Overall, RAExMC demonstrates practical impact by delivering strong predictive performance with substantial training efficiency, without requiring external knowledge sources.

Abstract

Extreme multi-label classification (XMC) seeks to find relevant labels from an extremely large label collection for a given text input. To tackle such a vast label space, current state-of-the-art methods fall into two categories. The one-versus-all (OVA) method uses learnable label embeddings for each label, excelling at memorization (i.e., capturing detailed training signals for accurate head label prediction). In contrast, the dual-encoder (DE) model maps input and label text into a shared embedding space for better generalization (i.e., the capability of predicting tail labels with limited training data), but may fall short at memorization. To achieve generalization and memorization, existing XMC methods often combine DE and OVA models, which involves complex training pipelines. Inspired by the success of retrieval-augmented language models, we propose the Retrieval-augmented Encoders for XMC (RAEXMC), a novel framework that equips a DE model with retrieval-augmented capability for efficient memorization without additional trainable parameter. During training, RAEXMC is optimized by the contrastive loss over a knowledge memory that consists of both input instances and labels. During inference, given a test input, RAEXMC retrieves the top-$K$ keys from the knowledge memory, and aggregates the corresponding values as the prediction scores. We showcase the effectiveness and efficiency of RAEXMC on four public LF-XMC benchmarks. RAEXMC not only advances the state-of-the-art (SOTA) DE method DEXML, but also achieves more than 10x speedup on the largest LF-AmazonTitles-1.3M dataset under the same 8 A100 GPUs training environments.

Retrieval-augmented Encoders for Extreme Multi-label Text Classification

TL;DR

Extreme multi-label classification with millions of labels benefits from both memorization and generalization. RAExMC couples a dual-encoder with a retrieval-augmented memory, performing retrieve-then-predict over a joint knowledge memory and using a lightweight predictor to map retrieved keys to label scores via . Training relies on a decoupled, contrastive loss with in-batch negatives, while inference retrieves top- keys with ANN and aggregates their values, achieving state-of-the-art results among DE methods and significant speedups on large LF-XMC benchmarks. The framework offers a tunable balance between memorization and generalization through a memory-aggregation parameter , enabling flexible adaptation to head versus tail label performance, though it introduces additional inference overhead from the knowledge memory. Overall, RAExMC demonstrates practical impact by delivering strong predictive performance with substantial training efficiency, without requiring external knowledge sources.

Abstract

Extreme multi-label classification (XMC) seeks to find relevant labels from an extremely large label collection for a given text input. To tackle such a vast label space, current state-of-the-art methods fall into two categories. The one-versus-all (OVA) method uses learnable label embeddings for each label, excelling at memorization (i.e., capturing detailed training signals for accurate head label prediction). In contrast, the dual-encoder (DE) model maps input and label text into a shared embedding space for better generalization (i.e., the capability of predicting tail labels with limited training data), but may fall short at memorization. To achieve generalization and memorization, existing XMC methods often combine DE and OVA models, which involves complex training pipelines. Inspired by the success of retrieval-augmented language models, we propose the Retrieval-augmented Encoders for XMC (RAEXMC), a novel framework that equips a DE model with retrieval-augmented capability for efficient memorization without additional trainable parameter. During training, RAEXMC is optimized by the contrastive loss over a knowledge memory that consists of both input instances and labels. During inference, given a test input, RAEXMC retrieves the top- keys from the knowledge memory, and aggregates the corresponding values as the prediction scores. We showcase the effectiveness and efficiency of RAEXMC on four public LF-XMC benchmarks. RAEXMC not only advances the state-of-the-art (SOTA) DE method DEXML, but also achieves more than 10x speedup on the largest LF-AmazonTitles-1.3M dataset under the same 8 A100 GPUs training environments.

Paper Structure

This paper contains 40 sections, 10 equations, 3 figures, 9 tables, 2 algorithms.

Figures (3)

  • Figure 1: The proposed RAE-XMC framework. The knowledge retriever consists of an encoder and a $k$-NN searcher, which retrieves the top $b$ (key, value) pairs from a joint knowledge memory. The key consists of embeddings of training instances and label text descriptions, while the value consists of their corresponding one-hot label vectors. A lightweight predictor then combines the labels based on their scores to generate the final prediction.
  • Figure 2: Model performance versus model training time on two large-scale LF-XMC datasets. Y-axis and X-axis are Precision@1 metric and training time in hours measured on 8 Nvidia A100 GPUs.
  • Figure 3: Analysis of different sources of top-$b$ keys retrieved from the knowledge retriever as well as the relation to the model performance.