Table of Contents
Fetching ...

Zero-shot data citation function classification using transformer-based large language models (LLMs)

Neil Byers, Ali Zaidi, Valerie Skye, Chris Beecroft, Kjiersten Fagnan

TL;DR

The paper tackles the problem of identifying how genomic data are used within publications by performing zero-shot data-citation function classification with an open-weight transformer model. It introduces a machine-assistant workflow that combines a decision-tree prompting strategy with retrieval-augmented generation and a novel evaluation framework called SARGO to produce structured labels (Data Accessed, Use Cases, Software/Tools) for publications linked to genomic datasets. The approach achieves a zero-shot $F1$ of $0.674$ on an evaluation set, though performance is sensitive to prompt design and data availability, and current results fall short of robust production or policy-informing standards. The study highlights the substantial resource costs, data limitations, and evaluation challenges involved in deploying LLM-based data-citation classification, and it outlines directions such as targeted fine-tuning, chunking, and automated prompt development to improve robustness while acknowledging that zero-shot deployments may remain preferable in data-scarce or costly contexts.

Abstract

Efforts have increased in recent years to identify associations between specific datasets and the scientific literature that incorporates them. Knowing that a given publication cites a given dataset, the next logical step is to explore how or why that data was used. Advances in recent years with pretrained, transformer-based large language models (LLMs) offer potential means for scaling the description of data use cases in the published literature. This avoids expensive manual labeling and the development of training datasets for classical machine-learning (ML) systems. In this work we apply an open-source LLM, Llama 3.1-405B, to generate structured data use case labels for publications known to incorporate specific genomic datasets. We also introduce a novel evaluation framework for determining the efficacy of our methods. Our results demonstrate that the stock model can achieve an F1 score of .674 on a zero-shot data citation classification task with no previously defined categories. While promising, our results are qualified by barriers related to data availability, prompt overfitting, computational infrastructure, and the expense required to conduct responsible performance evaluation.

Zero-shot data citation function classification using transformer-based large language models (LLMs)

TL;DR

The paper tackles the problem of identifying how genomic data are used within publications by performing zero-shot data-citation function classification with an open-weight transformer model. It introduces a machine-assistant workflow that combines a decision-tree prompting strategy with retrieval-augmented generation and a novel evaluation framework called SARGO to produce structured labels (Data Accessed, Use Cases, Software/Tools) for publications linked to genomic datasets. The approach achieves a zero-shot of on an evaluation set, though performance is sensitive to prompt design and data availability, and current results fall short of robust production or policy-informing standards. The study highlights the substantial resource costs, data limitations, and evaluation challenges involved in deploying LLM-based data-citation classification, and it outlines directions such as targeted fine-tuning, chunking, and automated prompt development to improve robustness while acknowledging that zero-shot deployments may remain preferable in data-scarce or costly contexts.

Abstract

Efforts have increased in recent years to identify associations between specific datasets and the scientific literature that incorporates them. Knowing that a given publication cites a given dataset, the next logical step is to explore how or why that data was used. Advances in recent years with pretrained, transformer-based large language models (LLMs) offer potential means for scaling the description of data use cases in the published literature. This avoids expensive manual labeling and the development of training datasets for classical machine-learning (ML) systems. In this work we apply an open-source LLM, Llama 3.1-405B, to generate structured data use case labels for publications known to incorporate specific genomic datasets. We also introduce a novel evaluation framework for determining the efficacy of our methods. Our results demonstrate that the stock model can achieve an F1 score of .674 on a zero-shot data citation classification task with no previously defined categories. While promising, our results are qualified by barriers related to data availability, prompt overfitting, computational infrastructure, and the expense required to conduct responsible performance evaluation.

Paper Structure

This paper contains 18 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Decision tree prompt structure underlying the machine assistant's workflow. The final outputs for a given publication-accession pair are represented by the dark gray boxes.