Table of Contents
Fetching ...

SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog

Jennifer D'Souza, Sameer Sadruddin, Holger Israel, Mathias Begoin, Diana Slawig

TL;DR

This paper presents SemEval-2025 Task 5 LLMs4Subjects, a benchmark for automated subject tagging in the German/English TIBKAT catalog using the GND taxonomy. It evaluates a range of LLM-based and retrieval-driven systems on two dataset collections (all-subjects and tib-core), with bilingual processing and a top-k subject recommendation setting. Key findings show multilingual models, synthetic data augmentation, and retrieval-augmented pipelines materially boost performance, while very large LLMs do not always outperform well-engineered, smaller systems. The study also integrates qualitative expert assessments, revealing domain-specific strengths and weaknesses, and sets the stage for energy-efficient LLM research in a follow-up edition. All data, code, and resources are openly available, enabling replication and future extensions of this benchmark and its evaluation framework.

Abstract

We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into applying LLMs for digital library classification.

SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog

TL;DR

This paper presents SemEval-2025 Task 5 LLMs4Subjects, a benchmark for automated subject tagging in the German/English TIBKAT catalog using the GND taxonomy. It evaluates a range of LLM-based and retrieval-driven systems on two dataset collections (all-subjects and tib-core), with bilingual processing and a top-k subject recommendation setting. Key findings show multilingual models, synthetic data augmentation, and retrieval-augmented pipelines materially boost performance, while very large LLMs do not always outperform well-engineered, smaller systems. The study also integrates qualitative expert assessments, revealing domain-specific strengths and weaknesses, and sets the stage for energy-efficient LLM research in a follow-up edition. All data, code, and resources are openly available, enabling replication and future extensions of this benchmark and its evaluation framework.

Abstract

We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into applying LLMs for digital library classification.

Paper Structure

This paper contains 18 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: K@5 Results on the record type ablation. A - article, B - book, C - conference, R - report, and T - thesis. On the x-axis, teams are listed in alphabetical order of names.
  • Figure 2: K@5 Results on the language ablation. On the x-axis, teams are listed in alphabetical order of names.
  • Figure 3: Overall qualitative evaluation results w.r.t. metric@5 and averages per metric@k where k = 5, 10, 15, and 20. On the x-axis, teams are listed in ranked order of performance based on average recall@k.
  • Figure 4: Qualitative results per 14 distinct domains. Acronyms used: Architecture (arc), Chemistry (che), Electrical Engineering (elt), Material Science (fer), History (his), Computer Science (inf), Linguistics (lin), Literature Studies (lit), Mathematics (mat), Economics (oek), Physics (phy), Social Sciences (sow), Engineering (tec), and Traffic Engineering (ver). On the x-axis, teams are listed in alphabetical order of names.