Table of Contents
Fetching ...

Knowledge Synthesis of Photosynthesis Research Using a Large Language Model

Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn

TL;DR

The paper addresses the cognitive burden and information overload in photosynthesis research by introducing PRAG, a GPT-4o-based photosynthesis research assistant that uses retrieval-augmented generation and prompt optimization. PRAG integrates a vector database, an automated feedback loop, and a knowledge graph to structure responses, enabling hypothesis generation and knowledge synthesis with paper-level transparency. It reports an average improvement of $8.7$ across five scientific-writing metrics and a $25.4$ increase in source transparency over a baseline, with scientific depth and domain coverage approaching that of published papers. The approach yields robust, domain-spanning insights and is demonstrated through extensive evaluation and visualization, with code publicly available for broader plant science applications and future multimodal enhancements.

Abstract

The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to provide accurate scientific contexts. Therefore, this study proposed a photosynthesis research assistant (PRAG) based on OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt optimization. Vector databases and an automated feedback loop were used in the prompt optimization process to enhance the accuracy and relevance of the responses to photosynthesis-related queries. PRAG showed an average improvement of 8.7% across five metrics related to scientific writing, with a 25.4% increase in source transparency. Additionally, its scientific depth and domain coverage were comparable to those of photosynthesis research papers. A knowledge graph was used to structure PRAG's responses with papers within and outside the database, which allowed PRAG to match key entities with 63% and 39.5% of the database and test papers, respectively. PRAG can be applied for photosynthesis research and broader plant science domains, paving the way for more in-depth data analysis and predictive capabilities.

Knowledge Synthesis of Photosynthesis Research Using a Large Language Model

TL;DR

The paper addresses the cognitive burden and information overload in photosynthesis research by introducing PRAG, a GPT-4o-based photosynthesis research assistant that uses retrieval-augmented generation and prompt optimization. PRAG integrates a vector database, an automated feedback loop, and a knowledge graph to structure responses, enabling hypothesis generation and knowledge synthesis with paper-level transparency. It reports an average improvement of across five scientific-writing metrics and a increase in source transparency over a baseline, with scientific depth and domain coverage approaching that of published papers. The approach yields robust, domain-spanning insights and is demonstrated through extensive evaluation and visualization, with code publicly available for broader plant science applications and future multimodal enhancements.

Abstract

The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to provide accurate scientific contexts. Therefore, this study proposed a photosynthesis research assistant (PRAG) based on OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt optimization. Vector databases and an automated feedback loop were used in the prompt optimization process to enhance the accuracy and relevance of the responses to photosynthesis-related queries. PRAG showed an average improvement of 8.7% across five metrics related to scientific writing, with a 25.4% increase in source transparency. Additionally, its scientific depth and domain coverage were comparable to those of photosynthesis research papers. A knowledge graph was used to structure PRAG's responses with papers within and outside the database, which allowed PRAG to match key entities with 63% and 39.5% of the database and test papers, respectively. PRAG can be applied for photosynthesis research and broader plant science domains, paving the way for more in-depth data analysis and predictive capabilities.

Paper Structure

This paper contains 15 sections, 6 figures.

Figures (6)

  • Figure 1: Overview of our contributions. a, We selected 150 photosynthesis research papers with high citation indices to generate and curate research questions and answers (QAs). Then, the papers were incorporated into a vector database (DB), and the GPT-4o model was improved using RAG and prompt optimization processes. PRAG and the baseline GPT-4o model were comparatively evaluated based on five metrics required for scientific paper writing. b, Core components of scientific papers, such as hypotheses and discussions, were mapped to the LLM's question-and-answer format. We prepared database papers and test papers, generated PRAG responses to the hypotheses of each paper in raw text, and structured and parsed these responses for visualization in a knowledge graph. Finally, we evaluated PRAG's potential to provide scientific insights through structural and semantic comparative analyses.
  • Figure 2: Evaluation of the scientific writing performance of the photosynthesis research assistant PRAG and prompt optimization process. a, Responses of the improved model were evaluated based on five metrics required for scientific paper writing (scientific accuracy, alignment with research objectives, source transparency, scholarly tone, and information reliability). The evaluation was conducted using 150 training sets and 2,000 test sets out of 10,000 QA sets, thus reflecting the linguistic characteristics of QA sets extracted from photosynthesis research papers. The t-test results showed significant differences between the two models across all metrics ($p < 0.001$). This indicates that the PRAG model performed significantly better overall compared to the GPT-4o model (*** indicates $p < 0.001$). b, Score change trends across the five evaluation metrics were analyzed as the RAG and prompt optimization processes were applied. The plot shows the score changes from the base model (Base) to the model with RAG only (1st iteration) and then with both RAG and prompt optimization (2nd iteration, Sky blue dashed line). The light background indicates the standard deviation (SD) at each iteration. c, Prompt optimization process: When a research question is input, the RAG Assistant generates an initial response based on the photosynthesis research database. The generated response and citation references are evaluated by the RAG Evaluator using five evaluation metrics. Feedback (scores and evaluations) is then delivered to the Prompt Refiner, where prompt adjustments are made to improve the response. Responses with low scores are filtered out, and a total of 10 iterations are performed to achieve optimization.
  • Figure 3: Analysis of the scientific depth and domain coverage for photosynthesis research papers and PRAG discussion. a, Histogram showing the scientific depth (top-left) and domain coverage scores (top-right) for 150 photosynthesis research papers using the GPT-4o-mini model (gray color) and 30 PRAG discussions (teal color). b, Scatter plot depicting the relationship between scientific depth and domain coverage scores for photosynthesis-related texts. c. Highest-scoring PRAG discussion (teal dashed box), research paper text sample containing the corresponding hypothesis (gray dashed box), and evaluation details. The full set of PRAG discussions (PDFs) and code for the scientific text evaluation model are available at https://github.com/PRAG-SNU.
  • Figure 4: Comparison of concepts and relationships in research papers and PRAG discussions using the knowledge graph. a, Comparison of the number of tokens and entities in database papers, test papers, and PRAG discussions. b, Similarity evaluation of the entity match rate, relationship match rate, structural similarity (average of entity and relationship matches), and semantic similarity (including similar entities and relationships, then averaged). c, Visualization of entity distribution across spatial and temporal scales, with entities extracted from scientific texts mapped across spatial scales (from the molecular to macro-environment level) and temporal scales (from immediate crop responses to long-term responses spanning centuries). The scientific text parser, knowledge graph construction code, and entity visualization code are available at https://github.com/PRAG-SNU.
  • Figure 5: Knowledge graph comparison between databased (DB) research papers and PRAG discussions. We used the knowledge graph to compare DB papers and PRAG discussions on the following questions: “How important are far-red photons in plant photosynthesis, and how do these photons affect crop photosynthetic efficiency and productivity under sunlight conditions?” a, Core entity diagram and key properties of shared entities in both DB papers and PRAG discussions. b, Knowledge graph comparison between DB papers and PRAG: Both graphs share key concepts related to the hypothesis, showing structural alignment at molecular and cellular levels to the macro-environmental scale. c, PRAG entities and relationships at the agricultural and crop scale. d, PRAG entities and relationships at the molecular and cellular scale: Dotted lines indicate concepts that were added by PRAG but omitted in the DB query. e, Mapping of knowledge graphs related to photosynthesis mechanisms across spatiotemporal scales. The knowledge graph illustrates the interactions related to photosynthesis at various spatiotemporal levels.
  • ...and 1 more figures