Table of Contents
Fetching ...

A Reliable Knowledge Processing Framework for Combustion Science using Foundation Models

Vansh Sharma, Venkat Raman

TL;DR

This research explores the integration of large language models (LLMs) into scientific data assimilation, focusing on combustion science as a case study, and introduces a custom workflow developed with a detection algorithm to filter out inaccuracies.

Abstract

This research explores the integration of large language models (LLMs) into scientific data assimilation, focusing on combustion science as a case study. Leveraging foundational models integrated with Retrieval-Augmented Generation (RAG) framework, the study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature. The multifaceted nature of combustion research emphasizes the critical role of knowledge processing in navigating and extracting valuable information from a vast and diverse pool of sources. The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy. It incorporates prompt engineering and offline open-source LLMs, offering user autonomy in selecting base models. The study provides a thorough examination of text segmentation strategies, conducts comparative studies between LLMs, and explores various optimized prompts to demonstrate the effectiveness of the framework. By incorporating an external database, the framework outperforms a conventional LLM in generating accurate responses and constructing robust arguments. Additionally, the study delves into the investigation of optimized prompt templates for the purpose of efficient extraction of scientific literature. The research addresses concerns related to hallucinations and false research articles by introducing a custom workflow developed with a detection algorithm to filter out inaccuracies. Despite identified areas for improvement, the framework consistently delivers accurate domain-specific responses with minimal human oversight. The prompt-agnostic approach introduced holds promise for future deliberations. The study underscores the significance of integrating LLMs and knowledge processing techniques in scientific research, providing a foundation for advancements in data assimilation and utilization.

A Reliable Knowledge Processing Framework for Combustion Science using Foundation Models

TL;DR

This research explores the integration of large language models (LLMs) into scientific data assimilation, focusing on combustion science as a case study, and introduces a custom workflow developed with a detection algorithm to filter out inaccuracies.

Abstract

This research explores the integration of large language models (LLMs) into scientific data assimilation, focusing on combustion science as a case study. Leveraging foundational models integrated with Retrieval-Augmented Generation (RAG) framework, the study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature. The multifaceted nature of combustion research emphasizes the critical role of knowledge processing in navigating and extracting valuable information from a vast and diverse pool of sources. The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy. It incorporates prompt engineering and offline open-source LLMs, offering user autonomy in selecting base models. The study provides a thorough examination of text segmentation strategies, conducts comparative studies between LLMs, and explores various optimized prompts to demonstrate the effectiveness of the framework. By incorporating an external database, the framework outperforms a conventional LLM in generating accurate responses and constructing robust arguments. Additionally, the study delves into the investigation of optimized prompt templates for the purpose of efficient extraction of scientific literature. The research addresses concerns related to hallucinations and false research articles by introducing a custom workflow developed with a detection algorithm to filter out inaccuracies. Despite identified areas for improvement, the framework consistently delivers accurate domain-specific responses with minimal human oversight. The prompt-agnostic approach introduced holds promise for future deliberations. The study underscores the significance of integrating LLMs and knowledge processing techniques in scientific research, providing a foundation for advancements in data assimilation and utilization.
Paper Structure (17 sections, 1 equation, 11 figures, 2 tables)

This paper contains 17 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Radar diagram comparing different LLM optimization strategies for information extraction task. The current work focuses on RAG integrated with prompt engineering as the strategy for adapting language models to specific science domains.
  • Figure 2: Process workflow for information retrieval and querying.
  • Figure 3: Embedding documents using multi-processing framework to persist in a database.
  • Figure 4: Workflow for generative answering process.
  • Figure 5: Optimal prompt stencils structures for knowledge extraction.
  • ...and 6 more figures