Table of Contents
Fetching ...

Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science

Lachlan McGinness, Peter Baumgartner

TL;DR

This study systematically evaluates GPT-3.5 Turbo and GPT-4 Turbo as assistive tools for systematic literature reviews across four CSIRO interdisciplinary system science case studies, introducing a semantic text highlighting method to aid expert review. It demonstrates that transformer embedding cosine similarity correlates more strongly with expert judgments than SpaCy, with quotes reproduced at >95% fidelity and overall answer accuracy around $83%$, though performance declines with task complexity and when multiple tasks are combined in one call. A two-step process to verify quotes via fuzzy matching is used to mitigate hallucinations, and a keyword-driven highlighting algorithm provides visual cues to researchers; results show a trade-off between trustworthiness (evidence provision) and efficiency (token cost). The work informs best practices for LLM-assisted SLRs, emphasizing careful prompt design, domain-specific vocabulary, and the need for expert validation, while outlining concrete avenues for future refinement in parsing, keyword calibration, and probabilistic similarity measures.

Abstract

Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers to perform systematic literature reviews (SLR). We evaluate the performance of LLMs for SLR tasks in these case studies. In each, we explore the impact of changing parameters on the accuracy of LLM responses. The LLM was tasked with extracting evidence from chosen academic papers to answer specific research questions. We evaluate the models' performance in faithfully reproducing quotes from the literature and subject experts were asked to assess the model performance in answering the research questions. We developed a semantic text highlighting tool to facilitate expert review of LLM responses. We found that state of the art LLMs were able to reproduce quotes from texts with greater than 95% accuracy and answer research questions with an accuracy of approximately 83%. We use two methods to determine the correctness of LLM responses; expert review and the cosine similarity of transformer embeddings of LLM and expert answers. The correlation between these methods ranged from 0.48 to 0.77, providing evidence that the latter is a valid metric for measuring semantic similarity.

Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science

TL;DR

This study systematically evaluates GPT-3.5 Turbo and GPT-4 Turbo as assistive tools for systematic literature reviews across four CSIRO interdisciplinary system science case studies, introducing a semantic text highlighting method to aid expert review. It demonstrates that transformer embedding cosine similarity correlates more strongly with expert judgments than SpaCy, with quotes reproduced at >95% fidelity and overall answer accuracy around , though performance declines with task complexity and when multiple tasks are combined in one call. A two-step process to verify quotes via fuzzy matching is used to mitigate hallucinations, and a keyword-driven highlighting algorithm provides visual cues to researchers; results show a trade-off between trustworthiness (evidence provision) and efficiency (token cost). The work informs best practices for LLM-assisted SLRs, emphasizing careful prompt design, domain-specific vocabulary, and the need for expert validation, while outlining concrete avenues for future refinement in parsing, keyword calibration, and probabilistic similarity measures.

Abstract

Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers to perform systematic literature reviews (SLR). We evaluate the performance of LLMs for SLR tasks in these case studies. In each, we explore the impact of changing parameters on the accuracy of LLM responses. The LLM was tasked with extracting evidence from chosen academic papers to answer specific research questions. We evaluate the models' performance in faithfully reproducing quotes from the literature and subject experts were asked to assess the model performance in answering the research questions. We developed a semantic text highlighting tool to facilitate expert review of LLM responses. We found that state of the art LLMs were able to reproduce quotes from texts with greater than 95% accuracy and answer research questions with an accuracy of approximately 83%. We use two methods to determine the correctness of LLM responses; expert review and the cosine similarity of transformer embeddings of LLM and expert answers. The correlation between these methods ranged from 0.48 to 0.77, providing evidence that the latter is a valid metric for measuring semantic similarity.

Paper Structure

This paper contains 12 sections, 2 equations, 4 tables, 1 algorithm.