Table of Contents
Fetching ...

Extreme Multi-label Completion for Semantic Document Labelling with Taxonomy-Aware Parallel Learning

Julien Audiffren, Christophe Broillet, Ljiljana Dolamic, Philippe Cudré-Mauroux

TL;DR

TAMLEC tackles extreme multi-label completion by embedding taxonomy structure into Taxonomy-Aware Tasks and predicting label paths with a Transformer-based architecture. It introduces a Weak-Semilattice formulation to accommodate multi-parent taxonomies and defines a TAT-based loss that balances cross-task sharing. Empirical results on MAG-CS, PubMed, and EURLex show TAMLEC outperforms state-of-the-art XMLCo methods, with strong performance in few-shot scenarios. The approach offers practical advantages for large-scale, taxonomy-rich labeling and rapid adaptation to new tasks and labels.

Abstract

In Extreme Multi Label Completion (XMLCo), the objective is to predict the missing labels of a collection of documents. Together with XML Classification, XMLCo is arguably one of the most challenging document classification tasks, as the very high number of labels (at least ten of thousands) is generally very large compared to the number of available labelled documents in the training dataset. Such a task is often accompanied by a taxonomy that encodes the labels organic relationships, and many methods have been proposed to leverage this hierarchy to improve the results of XMLCo algorithms. In this paper, we propose a new approach to this problem, TAMLEC (Taxonomy-Aware Multi-task Learning for Extreme multi-label Completion). TAMLEC divides the problem into several Taxonomy-Aware Tasks, i.e. subsets of labels adapted to the hierarchical paths of the taxonomy, and trains on these tasks using a dynamic Parallel Feature sharing approach, where some parts of the model are shared between tasks while others are task-specific. Then, at inference time, TAMLEC uses the labels available in a document to infer the appropriate tasks and to predict missing labels. To achieve this result, TAMLEC uses a modified transformer architecture that predicts ordered sequences of labels on a Weak-Semilattice structure that is naturally induced by the tasks. This approach yields multiple advantages. First, our experiments on real-world datasets show that TAMLEC outperforms state-of-the-art methods for various XMLCo problems. Second, TAMLEC is by construction particularly suited for few-shots XML tasks, where new tasks or labels are introduced with only few examples, and extensive evaluations highlight its strong performance compared to existing methods.

Extreme Multi-label Completion for Semantic Document Labelling with Taxonomy-Aware Parallel Learning

TL;DR

TAMLEC tackles extreme multi-label completion by embedding taxonomy structure into Taxonomy-Aware Tasks and predicting label paths with a Transformer-based architecture. It introduces a Weak-Semilattice formulation to accommodate multi-parent taxonomies and defines a TAT-based loss that balances cross-task sharing. Empirical results on MAG-CS, PubMed, and EURLex show TAMLEC outperforms state-of-the-art XMLCo methods, with strong performance in few-shot scenarios. The approach offers practical advantages for large-scale, taxonomy-rich labeling and rapid adaptation to new tasks and labels.

Abstract

In Extreme Multi Label Completion (XMLCo), the objective is to predict the missing labels of a collection of documents. Together with XML Classification, XMLCo is arguably one of the most challenging document classification tasks, as the very high number of labels (at least ten of thousands) is generally very large compared to the number of available labelled documents in the training dataset. Such a task is often accompanied by a taxonomy that encodes the labels organic relationships, and many methods have been proposed to leverage this hierarchy to improve the results of XMLCo algorithms. In this paper, we propose a new approach to this problem, TAMLEC (Taxonomy-Aware Multi-task Learning for Extreme multi-label Completion). TAMLEC divides the problem into several Taxonomy-Aware Tasks, i.e. subsets of labels adapted to the hierarchical paths of the taxonomy, and trains on these tasks using a dynamic Parallel Feature sharing approach, where some parts of the model are shared between tasks while others are task-specific. Then, at inference time, TAMLEC uses the labels available in a document to infer the appropriate tasks and to predict missing labels. To achieve this result, TAMLEC uses a modified transformer architecture that predicts ordered sequences of labels on a Weak-Semilattice structure that is naturally induced by the tasks. This approach yields multiple advantages. First, our experiments on real-world datasets show that TAMLEC outperforms state-of-the-art methods for various XMLCo problems. Second, TAMLEC is by construction particularly suited for few-shots XML tasks, where new tasks or labels are introduced with only few examples, and extensive evaluations highlight its strong performance compared to existing methods.

Paper Structure

This paper contains 14 sections, 1 theorem, 5 equations, 3 figures, 3 tables.

Key Result

lemma 1

Let $(T,\leq)$ be a Poset. Then $(T,\leq)$ has a Condorcet winner if and only if $(T,\leq)$ is a Weak-Semilattice

Figures (3)

  • Figure 1: Example of a toy taxonomy of scientific labels. An arrow from $\ell_1$ to $\ell_2$ represents $\ell_1 \leq \ell_2$. This taxonomy can be represented with a Weak-Semilattice, and not with a tree, as the label "LLMs" has multiple parents.
  • Figure 2: Example of a Weak Semilattice taxonomy with a Taxonomy Aware Tasks decomposition, represented by the different polygons.
  • Figure 3: TAMLEC's architecture. The model is made of 6 Encoders and 6 Decoders with weights shared across tasks, as well as one Task Specific Generator per task, whose weights are task specific.

Theorems & Definitions (6)

  • definition 1: Partially Ordered Set
  • definition 2: Weak-Semilattice
  • lemma 1
  • definition 3: Children in a Weak-Semilattice
  • definition 4: Width of a Weak-Semilattice
  • definition 5: Taxonomy Aware Tasks