Table of Contents
Fetching ...

Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel

TL;DR

The paper addresses the challenge of extreme multi-label classification with massive label spaces by proposing two key advances: scaling decoder-only large language models (up to 7B parameters) for embedding-based XMC and introducing ViXML, a vision-enhanced framework that injects visual metadata via an efficient, early-fusion approach. It demonstrates that decoder-only models can surpass encoder-based baselines when paired with structured prompting and contrastive learning, and that ViXML enables strong multi-modal performance with minimal computational overhead by using a frozen vision encoder and a single image embedding per image. Empirically, the approach achieves state-of-the-art results across four text-only datasets and their image-augmented variants, with P@1 gains up to +8.21 percentage points, and sometimes the ViXML-enabled, modestly sized encoders outperforming large text-only models. The work also provides dataset extensions with visual metadata to support future multi-modal XMC benchmarking and discusses practical considerations such as latency, prompt design, and potential directions for scaling and refinement.

Abstract

Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.

Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

TL;DR

The paper addresses the challenge of extreme multi-label classification with massive label spaces by proposing two key advances: scaling decoder-only large language models (up to 7B parameters) for embedding-based XMC and introducing ViXML, a vision-enhanced framework that injects visual metadata via an efficient, early-fusion approach. It demonstrates that decoder-only models can surpass encoder-based baselines when paired with structured prompting and contrastive learning, and that ViXML enables strong multi-modal performance with minimal computational overhead by using a frozen vision encoder and a single image embedding per image. Empirically, the approach achieves state-of-the-art results across four text-only datasets and their image-augmented variants, with P@1 gains up to +8.21 percentage points, and sometimes the ViXML-enabled, modestly sized encoders outperforming large text-only models. The work also provides dataset extensions with visual metadata to support future multi-modal XMC benchmarking and discusses practical considerations such as latency, prompt design, and potential directions for scaling and refinement.

Abstract

Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.

Paper Structure

This paper contains 31 sections, 4 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Overview of ViXML multi-modal framework, which supports both encoder language models (LMs) and decoder LLMs. ViXML efficiently incorporates visual metadata in queries ($a_i$) and labels ($a_r$) while freezing the vision encoder for efficiency. Prompts (${\mathcal{E}}'_{i}$ and ${\mathcal{E}}'_{r}$) combine text and projected image embeddings (${\mathcal{E}}$ and ${\mathcal{V}}$). Sentence embeddings (${{\mathbf{h}}}^{i}_{q}$ and ${{\mathbf{h}}}^{r}_{l}$) are learned via contrastive learning.
  • Figure 2: Performance (P@1) and Training Time (TT) for dual-encoder (dots) and dual-decoder (stars) learning in LF-AmazonTitles-131K. The ViXML multi-modal framework (blue) improves text-only alternatives (red), while decoder models boost encoder performance in both setups. Previous state-of-the-art (SOTA) is represented by MOGIC method.