Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

Francisco J. Ribadas-Pena; Shuyuan Cao; Víctor M. Darriba Bilbao

Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

Francisco J. Ribadas-Pena, Shuyuan Cao, Víctor M. Darriba Bilbao

TL;DR

The paper tackles large-scale XML-style text categorization for biomedical indexing using a label Autoencoder to embed MeSH descriptors into a latent space, enabling a latent-space $k$-NN classifier whose decoder reconstructs final label sets. It compares sparse BM25/Lucene and dense SPECTER-based representations for neighbor retrieval and evaluates multiple label-AE topologies (small, medium, large) with different thresholds and weighting schemes. Results show sparse representations outperform dense ones on a large MEDLINE corpus, with the medium-label-AE configuration delivering the best MiF while maintaining competitive precision and recall; however, it does not surpass state-of-the-art BioASQ results, and mixing AE predictions with baseline $k$-NN can improve overall F-score at the expense of ranking metrics. The work demonstrates the feasibility of latent-space label embeddings for semantic indexing and suggests promising directions for multilingual biomedical vocabularies such as DeCS, extending XML solutions beyond MeSH.

Abstract

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.

Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

TL;DR

The paper tackles large-scale XML-style text categorization for biomedical indexing using a label Autoencoder to embed MeSH descriptors into a latent space, enabling a latent-space

-NN classifier whose decoder reconstructs final label sets. It compares sparse BM25/Lucene and dense SPECTER-based representations for neighbor retrieval and evaluates multiple label-AE topologies (small, medium, large) with different thresholds and weighting schemes. Results show sparse representations outperform dense ones on a large MEDLINE corpus, with the medium-label-AE configuration delivering the best MiF while maintaining competitive precision and recall; however, it does not surpass state-of-the-art BioASQ results, and mixing AE predictions with baseline

-NN can improve overall F-score at the expense of ranking metrics. The work demonstrates the feasibility of latent-space label embeddings for semantic indexing and suggests promising directions for multilingual biomedical vocabularies such as DeCS, extending XML solutions beyond MeSH.

Abstract

Paper Structure (17 sections, 4 figures, 11 tables)

This paper contains 17 sections, 4 figures, 11 tables.

Introduction
Related Work
Multi-Label Categorization
Autoencoders in Multi-Label Learning
Semantic Indexing in the Biomedical Domain
Materials and Methods
Similarity Based Categorization ($k$-NN)
Document Representation
Sparse Representations
Dense Representations
Label Autoencoders
Results and Discussion
Dataset Details and Evaluation Metrics
Experimental Results
Dense vs. Sparse Representations
...and 2 more sections

Figures (4)

Figure S1: Architecture of a generic autoencoder.
Figure S2: Categorization using $k$-NN with label autoencoders.
Figure S3: Summary of performance metrics with sparse vs. dense representations for values of $k$ with best $MiF$ values.
Figure S4: Summary of $MiF$, $MiP$, $MiR$ metrics for values of $k$ and distance weighting with best $MiF$ values in each label-AE configuration (small, medium, large).

Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

TL;DR

Abstract

Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (4)