Table of Contents
Fetching ...

Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment

Margherita Martorana, Tobias Kuhn, Lise Stork, Jacco van Ossenbruggen

TL;DR

This work tackles metadata enrichment for restricted-access datasets by performing zero-shot, column-header topic classification using a fixed domain vocabulary (CESSDA) embedded in prompts as a Large Context Window. It benchmarks three LLMs—ChatGPT (GPT-3.5), Google Bard, and Google Gemini—against human annotators, examining internal consistency, inter-LLM alignment, and human-computer agreement, with and without contextual dataset descriptions. Key findings show that ChatGPT and Google Gemini generally outperform Google Bard in consistency and agreement, while contextual information has limited impact on performance. The study demonstrates the feasibility of automated metadata enrichment via LLMs in the Semantic Web domain and outlines directions to scale with larger vocabularies and retrieval-augmented approaches to improve dataset findability and FAIR compliance.

Abstract

Traditional dataset retrieval systems rely on metadata for indexing, rather than on the underlying data values. However, high-quality metadata creation and enrichment often require manual annotations, which is a labour-intensive and challenging process to automate. In this study, we propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini. Our analysis focuses on classifying column headers based on domain-specific topics from the Consortium of European Social Science Data Archives (CESSDA), a Linked Data controlled vocabulary. Our approach operates in a zero-shot setting, integrating the controlled topic vocabulary directly within the input prompt. This integration serves as a Large Context Windows approach, with the aim of improving the results of the topic classification task. We evaluated the performance of the LLMs in terms of internal consistency, inter-machine alignment, and agreement with human classification. Additionally, we investigate the impact of contextual information (i.e., dataset description) on the classification outcomes. Our findings suggest that ChatGPT and GoogleGemini outperform GoogleBard in terms of internal consistency as well as LLM-human-agreement. Interestingly, we found that contextual information had no significant impact on LLM performance. This work proposes a novel approach that leverages LLMs for topic classification of column headers using a controlled vocabulary, presenting a practical application of LLMs and Large Context Windows within the Semantic Web domain. This approach has the potential to facilitate automated metadata enrichment, thereby enhancing dataset retrieval and the Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data on the Web.

Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment

TL;DR

This work tackles metadata enrichment for restricted-access datasets by performing zero-shot, column-header topic classification using a fixed domain vocabulary (CESSDA) embedded in prompts as a Large Context Window. It benchmarks three LLMs—ChatGPT (GPT-3.5), Google Bard, and Google Gemini—against human annotators, examining internal consistency, inter-LLM alignment, and human-computer agreement, with and without contextual dataset descriptions. Key findings show that ChatGPT and Google Gemini generally outperform Google Bard in consistency and agreement, while contextual information has limited impact on performance. The study demonstrates the feasibility of automated metadata enrichment via LLMs in the Semantic Web domain and outlines directions to scale with larger vocabularies and retrieval-augmented approaches to improve dataset findability and FAIR compliance.

Abstract

Traditional dataset retrieval systems rely on metadata for indexing, rather than on the underlying data values. However, high-quality metadata creation and enrichment often require manual annotations, which is a labour-intensive and challenging process to automate. In this study, we propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini. Our analysis focuses on classifying column headers based on domain-specific topics from the Consortium of European Social Science Data Archives (CESSDA), a Linked Data controlled vocabulary. Our approach operates in a zero-shot setting, integrating the controlled topic vocabulary directly within the input prompt. This integration serves as a Large Context Windows approach, with the aim of improving the results of the topic classification task. We evaluated the performance of the LLMs in terms of internal consistency, inter-machine alignment, and agreement with human classification. Additionally, we investigate the impact of contextual information (i.e., dataset description) on the classification outcomes. Our findings suggest that ChatGPT and GoogleGemini outperform GoogleBard in terms of internal consistency as well as LLM-human-agreement. Interestingly, we found that contextual information had no significant impact on LLM performance. This work proposes a novel approach that leverages LLMs for topic classification of column headers using a controlled vocabulary, presenting a practical application of LLMs and Large Context Windows within the Semantic Web domain. This approach has the potential to facilitate automated metadata enrichment, thereby enhancing dataset retrieval and the Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data on the Web.
Paper Structure (18 sections, 2 equations, 2 figures, 3 tables)

This paper contains 18 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Summary of the topic classification task by the three LLMs, in the setting with no contextual information added to the prompt. We show the distribution of the topics classified based on 5 labels: 'Specific' topics, 'General' topics, the 'Other' topic, 'Unassigned' topics and 'Hallucinated' topics, i.e. outside of the controlled vocabulary.
  • Figure 2: Summary of the topic classification task by the three LLMs, in the setting with contextual information added to the prompt. We show the distribution of the topics classified based on 5 labels: 'Specific' topics, 'General' topics, the 'Other' topic, 'Unassigned' topics and 'Hallucinated' topics, i.e. outside of the controlled vocabulary.