Table of Contents
Fetching ...

MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

Hui Ye, Rajshekhar Sunderraman, Shihao Ji

Abstract

The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.

MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

Abstract

The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.
Paper Structure (14 sections, 5 equations, 3 figures, 12 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 3 figures, 12 tables, 1 algorithm.

Figures (3)

  • Figure 1: Architecture of text-label matching in a bipartite graph. When fine-tuning a pre-trained encoder Transformer for the t-th layer of HLT, we consider the input text set U (i.e., training samples) as the text modality, while the label set $V^{(t)}$ (i.e., training labels $Y^{(t)}$) as the label modality.
  • Figure 2: Label distributions of Amazon-670K and Amazon-3M follow the power (Zipf’s) Law, as shown in (a) and (b). Text distributions of Amazon-670K and Amazon-3M don't follow a particular standard form, as shown in (c) and (d).
  • Figure : MatchXML Training