Table of Contents
Fetching ...

Large Language Models for Scientific Information Extraction: An Empirical Study for Virology

Mahsa Shamsabadi, Jennifer D'Souza, Sören Auer

TL;DR

The paper addresses the challenge of navigating extensive scholarly literature by adopting semantic, structured representations via the ORKG and demonstrates an LLM-based pipeline to generate structured scholarly contribution summaries. It employs single-task instructionfinetuning of a moderate-sized FLAN-T5 model to extract six property values for $R0$ estimates from virology abstracts, outperforming larger baselines in zero-shot settings. The study contributes a gold-standard corpus of 1,500 orkg-R0 annotations, demonstrates the feasibility of a compact, instruction-tuned approach for complex information extraction, and discusses the implications for scalable, machine-actionable scholarly knowledge publishing. The work suggests future directions in model scaling, distillation, and broader domain expansion to enhance practical impact in scientific knowledge management.

Abstract

In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generation capabilities of LLMs to produce structured scholarly contribution summaries, offering both a practical solution and insights into LLMs' emergent abilities. For LLMs, the prime focus is on improving their general intelligence as conversational agents. We argue that these models can also be applied effectively in information extraction (IE), specifically in complex IE tasks within terse domains like Science. This paradigm shift replaces the traditional modular, pipelined machine learning approach with a simpler objective expressed through instructions. Our results show that finetuned FLAN-T5 with 1000x fewer parameters than the state-of-the-art GPT-davinci is competitive for the task.

Large Language Models for Scientific Information Extraction: An Empirical Study for Virology

TL;DR

The paper addresses the challenge of navigating extensive scholarly literature by adopting semantic, structured representations via the ORKG and demonstrates an LLM-based pipeline to generate structured scholarly contribution summaries. It employs single-task instructionfinetuning of a moderate-sized FLAN-T5 model to extract six property values for estimates from virology abstracts, outperforming larger baselines in zero-shot settings. The study contributes a gold-standard corpus of 1,500 orkg-R0 annotations, demonstrates the feasibility of a compact, instruction-tuned approach for complex information extraction, and discusses the implications for scalable, machine-actionable scholarly knowledge publishing. The work suggests future directions in model scaling, distillation, and broader domain expansion to enhance practical impact in scientific knowledge management.

Abstract

In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generation capabilities of LLMs to produce structured scholarly contribution summaries, offering both a practical solution and insights into LLMs' emergent abilities. For LLMs, the prime focus is on improving their general intelligence as conversational agents. We argue that these models can also be applied effectively in information extraction (IE), specifically in complex IE tasks within terse domains like Science. This paradigm shift replaces the traditional modular, pipelined machine learning approach with a simpler objective expressed through instructions. Our results show that finetuned FLAN-T5 with 1000x fewer parameters than the state-of-the-art GPT-davinci is competitive for the task.
Paper Structure (40 sections, 6 figures, 7 tables)

This paper contains 40 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Two structured research contributions compared in the Open Research Knowledge Graph (papers in columns, properties in rows and values in cells).
  • Figure 2: Comparing (A) instruction tuning with (B) instruction-tuned LLM domain- and task-tuning of this work.
  • Figure 3: Multiple instruction prompts describing our complex scientific information extraction (IE) task.
  • Figure 4: Performances range on inference instructions.
  • Figure 5: Our best model error types for text format.
  • ...and 1 more figures