Table of Contents
Fetching ...

Automating Research Synthesis with Domain-Specific Large Language Model Fine-Tuning

Teo Susnjak, Peter Hwang, Napoleon H. Reyes, Andre L. C. Barczak, Timothy R. McIntosh, Surangika Ranathunga

TL;DR

This work demonstrates that domain-specific fine-tuning of open-source LLMs, guided by a PEFT-based approach and supported by retrieval-augmented generation, can accelerate the knowledge-synthesis phase of systematic literature reviews while maintaining factual fidelity. By automatically constructing a rich finetuning dataset from SLR papers, inserting explicit provenance tokens, and validating against a PRISMA-based gold standard, the study shows high factual accuracy at the paper level with strong potential for broader domain application. It also introduces robust evaluation metrics (FEVER and CGS) and demonstrates replication of a published SLR, underscoring the framework’s credibility for reproducibility. The authors advocate updating PRISMA guidelines to accommodate AI-assisted workflows, highlighting the need for transparency, auditability, and methodological rigor in future SLRs.

Abstract

This research pioneers the use of fine-tuned Large Language Models (LLMs) to automate Systematic Literature Reviews (SLRs), presenting a significant and novel contribution in integrating AI to enhance academic research methodologies. Our study employed the latest fine-tuning methodologies together with open-sourced LLMs, and demonstrated a practical and efficient approach to automating the final execution stages of an SLR process that involves knowledge synthesis. The results maintained high fidelity in factual accuracy in LLM responses, and were validated through the replication of an existing PRISMA-conforming SLR. Our research proposed solutions for mitigating LLM hallucination and proposed mechanisms for tracking LLM responses to their sources of information, thus demonstrating how this approach can meet the rigorous demands of scholarly research. The findings ultimately confirmed the potential of fine-tuned LLMs in streamlining various labor-intensive processes of conducting literature reviews. Given the potential of this approach and its applicability across all research domains, this foundational study also advocated for updating PRISMA reporting guidelines to incorporate AI-driven processes, ensuring methodological transparency and reliability in future SLRs. This study broadens the appeal of AI-enhanced tools across various academic and research fields, setting a new standard for conducting comprehensive and accurate literature reviews with more efficiency in the face of ever-increasing volumes of academic studies.

Automating Research Synthesis with Domain-Specific Large Language Model Fine-Tuning

TL;DR

This work demonstrates that domain-specific fine-tuning of open-source LLMs, guided by a PEFT-based approach and supported by retrieval-augmented generation, can accelerate the knowledge-synthesis phase of systematic literature reviews while maintaining factual fidelity. By automatically constructing a rich finetuning dataset from SLR papers, inserting explicit provenance tokens, and validating against a PRISMA-based gold standard, the study shows high factual accuracy at the paper level with strong potential for broader domain application. It also introduces robust evaluation metrics (FEVER and CGS) and demonstrates replication of a published SLR, underscoring the framework’s credibility for reproducibility. The authors advocate updating PRISMA guidelines to accommodate AI-assisted workflows, highlighting the need for transparency, auditability, and methodological rigor in future SLRs.

Abstract

This research pioneers the use of fine-tuned Large Language Models (LLMs) to automate Systematic Literature Reviews (SLRs), presenting a significant and novel contribution in integrating AI to enhance academic research methodologies. Our study employed the latest fine-tuning methodologies together with open-sourced LLMs, and demonstrated a practical and efficient approach to automating the final execution stages of an SLR process that involves knowledge synthesis. The results maintained high fidelity in factual accuracy in LLM responses, and were validated through the replication of an existing PRISMA-conforming SLR. Our research proposed solutions for mitigating LLM hallucination and proposed mechanisms for tracking LLM responses to their sources of information, thus demonstrating how this approach can meet the rigorous demands of scholarly research. The findings ultimately confirmed the potential of fine-tuned LLMs in streamlining various labor-intensive processes of conducting literature reviews. Given the potential of this approach and its applicability across all research domains, this foundational study also advocated for updating PRISMA reporting guidelines to incorporate AI-driven processes, ensuring methodological transparency and reliability in future SLRs. This study broadens the appeal of AI-enhanced tools across various academic and research fields, setting a new standard for conducting comprehensive and accurate literature reviews with more efficiency in the face of ever-increasing volumes of academic studies.
Paper Structure (43 sections, 5 figures, 12 tables)

This paper contains 43 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The general process of conducting an SLR as outlined by okoli2015guide, denoting which steps should be explicitly reported to the reader, as well as explained to justify the comprehensiveness of the SLR despite exclusion criteria together with the detailed communication of steps taken to make the SLR reproducible.
  • Figure 2: Overview of the proposed SLR-automation Framework for evidence and knowledge synthesis.
  • Figure 3: Example of a correct response. The reference answer from the SLR study is "Predictive modelling has not featured in a larger percentage of reviewed LAD studies."
  • Figure 4: Example of a correct response. The reference answer from the SLR study is "We find that predictive modeling functionalities are not used in majority of cases within the reviewed LADs, and examples of interpretability of the models and the ability to explain their predictions to the learners do not yet exist in published studies."
  • Figure 5: Example of an incorrect response. The reference answer from the SLR study is "Bodily et al., 2018 (180), Chen et al., 2019, Aljohani et al., 2019 (86), Ulfa et al., 2019 (67), Majumdar et al., 2019, He et al., 2019 (327), Naranjo et al., 2019 (64), Baneres et al., 2019 (247), Gras et al., 2020 (127), Karaoglan Yilmaz & Yilmaz, 2020 (81), Fleur et al., 020 (79), Chatti et al., 2020 (414), Kia et al., 2020 (449), Owatari et al., 2020 (108), Han et al., 2021 (88), Kokoç & Altun, 2021 (126), Valle et al., 2021 (179)"