Table of Contents
Fetching ...

Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis

Jawad Ibn Ahad, Rafeed Mohammad Sultan, Abraham Kaikobad, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman

TL;DR

It is demonstrated that fine-tuned models outperform non-fine-tuned models, with fine-tuned LLMs generating 87.6% relevant meta-analysis abstracts, and the relevance of the context shows a reduction in irrelevancy.

Abstract

This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs). Meta-analysis is a robust statistical method that synthesizes the findings of multiple studies support articles to provide a comprehensive understanding. We know that a meta-article provides a structured analysis of several articles. However, conducting meta-analysis by hand is labor-intensive, time-consuming, and susceptible to human error, highlighting the need for automated pipelines to streamline the process. Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction. We automate and optimize the meta-analysis process by integrating Retrieval Augmented Generation (RAG). Tailored through prompt engineering and a new loss metric, Inverse Cosine Distance (ICD), designed for fine-tuning on large contextual datasets, LLMs efficiently generate structured meta-analysis content. Human evaluation then assesses relevance and provides information on model performance in key metrics. This research demonstrates that fine-tuned models outperform non-fine-tuned models, with fine-tuned LLMs generating 87.6% relevant meta-analysis abstracts. The relevance of the context, based on human evaluation, shows a reduction in irrelevancy from 4.56% to 1.9%. These experiments were conducted in a low-resource environment, highlighting the study's contribution to enhancing the efficiency and reliability of meta-analysis automation.

Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis

TL;DR

It is demonstrated that fine-tuned models outperform non-fine-tuned models, with fine-tuned LLMs generating 87.6% relevant meta-analysis abstracts, and the relevance of the context shows a reduction in irrelevancy.

Abstract

This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs). Meta-analysis is a robust statistical method that synthesizes the findings of multiple studies support articles to provide a comprehensive understanding. We know that a meta-article provides a structured analysis of several articles. However, conducting meta-analysis by hand is labor-intensive, time-consuming, and susceptible to human error, highlighting the need for automated pipelines to streamline the process. Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction. We automate and optimize the meta-analysis process by integrating Retrieval Augmented Generation (RAG). Tailored through prompt engineering and a new loss metric, Inverse Cosine Distance (ICD), designed for fine-tuning on large contextual datasets, LLMs efficiently generate structured meta-analysis content. Human evaluation then assesses relevance and provides information on model performance in key metrics. This research demonstrates that fine-tuned models outperform non-fine-tuned models, with fine-tuned LLMs generating 87.6% relevant meta-analysis abstracts. The relevance of the context, based on human evaluation, shows a reduction in irrelevancy from 4.56% to 1.9%. These experiments were conducted in a low-resource environment, highlighting the study's contribution to enhancing the efficiency and reliability of meta-analysis automation.

Paper Structure

This paper contains 12 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Paraphraser-based approach that combines multiple generated summary chunks from LLMs has been used by Subbiah2024ReadingSElim2023improving, (b) Retrieval augmentation generation-based approach has been applied in yepes2024financialmanathunga2023retrieval using a vector database to store chunked data and cluster them before passing to LLM to produce a summary. Existing methods often fall short of handling big scientific contextual data and generating structured synthesis. (c) We propose a novel approach involving fine-tuning LLMs with large contexts and utilizing them to generate meta-analysis abstracts. Abstracts from support papers serve as input, with meta-papers' abstracts as labels. Pre-processing involves chunking the dataset due to context length restrictions and prioritizing small LLMs over resource-intensive large LLMs. The fine-tuned model generates meta-analysis abstracts via semantic search with the provided context and query.
  • Figure 2: In our meta-analysis generation system, support articles $S^j$ undergo chunk-based pre-processing, producing chunks $C^j_i \subseteq S^j$, here "SP:" refers to an abstract of the support article $S^j$. These chunks are used to fine-tune the LLMs for predicting meta-analysis abstracts $y^j$, with the ICD loss guiding the fine-tuning process. Model performance is assessed through human evaluation of the relevancy of generated meta-analysis abstracts, $\hat{y}^j$ by fine-tuned LLMs. During inference, we integrate RAG with the fine-tuned LLMs. Chunked samples are stored in a vector database, from which relevant information is retrieved via a semantic search based on a query. The same processed $C^j_i$ is used for both fine-tuning and inference to maintain retrieval consistency. The retrieved content and the query are provided to the LLM, enabling it to generate more precise and accurate meta-analysis abstracts by leveraging comprehensive contextual information.
  • Figure 3: Distribution of Supporting Articles in Meta-Articles in the dataset MAD. The chart shows that most meta-articles contain 6 to 14 support articles, with peaks at 6 and 9, suggesting a common reliance on a moderate number of supporting studies, with fewer analyses incorporating larger study pools.
  • Figure 4: Investigating the impact of (a) Temperature variation: BLEU, ROUGE-1, ROUGE-2, and ROUGE-L scores vary with temperature changes for both the Llama-2 (7B) and Mistral-v0.1 (7B) models indicating 0.7 temperature has a better impact. (b) Loss Function impact: ICD loss significantly improves performance for Llama-2 (7B) FT and Mistral-v0.1 (7B) FT models, demonstrating its ability to capture more information than the default loss.