Table of Contents
Fetching ...

VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models

Blessy Antony, Amartya Dutta, Sneha Aggarwal, Vasu Gatne, Ozan Gökdemir, Samantha Grimes, Adam Lauring, Brian R. Wasik, Anuj Karpatne, T. M. Murali

Abstract

The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curate a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 239 scientific publications to serve as ground truth for the mutation extraction task. Finally, we demonstrate VILLA's superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools for SIE.

VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models

Abstract

The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curate a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 239 scientific publications to serve as ground truth for the mutation extraction task. Finally, we demonstrate VILLA's superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools for SIE.
Paper Structure (44 sections, 21 figures, 2 tables)

This paper contains 44 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 7: The overview page on evaluation interface allows evaluators to filter, sort, and select outputs for review while tracking evaluation progress and mitigate selection bias. Features include searchable virus/protein filters, status indicators (pending/completed), flexible sorting options, and anonymized output identifiers that prevent cherry-picking and ensure comprehensive evaluation coverage.
  • Figure 8: The evaluation interface for each LLM output consists of two primary components: a fixed left panel presenting extracted mutations and associated reasoning, and a right panel containing a structured Likert-scale rubric with dynamic descriptive text and an optional comment box for additional feedback.
  • Figure 9: The admin dashboard enables research oversight with tabular views of LLM outputs and evaluations, filtering by criteria ratings and metadata, CSV export functionality for data analysis, and enabling monitoring for tracking evaluation progress and ensuring data quality throughout the study.
  • Figure 10: Prompt used to query large language models (LLMs) for viral mutation extraction using zero-shot prompting. The text highlighted in green denotes the key points such as description of the SIE task to respond with mutations impacting virus-host interaction in a given viral protein, the required output format, and the representation format of mutations.
  • Figure 11: Distribution of the number of mutations identified by each of the eleven LLMs in ten proteins of the influenza A virus (x-axis) using zero-shot prompting. The gray bars denote the number of mutations in the ground truth known to impact virus-host interactions. The x-axis denotes the ten different influenza A proteins for which each LLM identified mutations. The height of each bar and the error mark in black correspond to the mean and standard deviation of the distribution respectively.
  • ...and 16 more figures