Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers
Aivin V. Solatorio, Rafael Macalaba, James Liounis
TL;DR
The paper tackles the problem of monitoring how datasets are mentioned and used across scientific literature, a task hampered by scarce labeled data. It introduces a scalable pipeline that leverages LLMs to generate synthetic data, a two-stage fine-tuning process for a specialized extractor (Phi-3.5-mini), and a ModernBERT classifier to filter candidate passages. Across climate-change–related corpora, the approach achieves state-of-the-art extraction performance, outperforming baselines NuExtract-v1.5 and GLiNER-large-v2.1, and demonstrates strong generalization in low-resource settings thanks to synthetic pretraining followed by high-quality manual annotation. The framework enhances data discoverability, transparency, and governance by enabling scalable monitoring of dataset usage for researchers, funders, and policymakers, with potential extensions to broader domains and more extensive manually annotated datasets.
Abstract
Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.
