An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Jennifer D'Souza; Sameer Sadruddin; Maximilian Kähler; Andrea Salfinger; Luca Zaccagna; Francesca Incitti; Lauro Snidaro; Osma Suominen

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Jennifer D'Souza, Sameer Sadruddin, Maximilian Kähler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen

TL;DR

A large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy that enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation.

Abstract

Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 5 figures, 12 tables)

This paper contains 24 sections, 1 equation, 5 figures, 12 tables.

Introduction
Related Work
Our Subject Indexing Dataset
The Subject Indexing Taxonomy
Our Library Records Dataset
Statistical Analysis of Subject Annotations
Overlap and Long-Tail Phenomenon
Distributional Divergence
Assessing Polysemy
Three Systems
Approaches
System 1 salfinger-etal-2025-la2i2f.
System 2 kahler-etal-2025-dnb.
System 3 suominen-etal-2025-annif-germeval.
Quantitative Results
...and 9 more sections

Figures (5)

Figure 1: Four example GND records in our internal JSON representation.
Figure 2: nDCG@5 scores by the five record types.
Figure 3: nDCG@5 scores by the two languages.
Figure 4: nDCG@k scores, where k=5,10,15,20, for qualitative evaluation on 10 records per five domains. For this exercise, subject predictions were manually labeled Y and I by subject specialists at the library.
Figure 5: Train-frequency distribution (binned) for the System 3 predictions, split into true positives and three false-negative subtypes.

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

TL;DR

Abstract

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)