Table of Contents
Fetching ...

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Sonakshi Gupta, Akhlak Mahmood, Wei Xiong, Rampi Ramprasad

TL;DR

This work tackles the difficulty of mining unstructured polymer literature by building literature-grounded reasoning systems for PHAs. It develops two retrieval-augmented generation pipelines, VectorRAG (dense semantic) and GraphRAG (knowledge-graph–based), trained on a curated PHA corpus of 1,028 papers (44,609 paragraphs) and a broader ~3 million-document literature corpus. The authors implement context-preserving paragraph chunks and domain-aware entity normalization to enable multi-hop reasoning with interpretable evidence trails, and they validate the systems with 113 domain-expert questions plus expert evaluations, highlighting that GraphRAG offers higher precision and interpretability while VectorRAG provides broader recall. The study demonstrates a practical, transparent path to trustworthy literature analysis at scale, with an interactive interface and performance that rivals or surpasses web-backed systems in domain grounding, while reducing reliance on opaque proprietary models and enabling applicability to other materials domains.

Abstract

Polymer literature contains a large and growing body of experimental knowledge, yet much of it is buried in unstructured text and inconsistent terminology, making systematic retrieval and reasoning difficult. Existing tools typically extract narrow, study-specific facts in isolation, failing to preserve the cross-study context required to answer broader scientific questions. Retrieval-augmented generation (RAG) offers a promising way to overcome this limitation by combining large language models (LLMs) with external retrieval, but its effectiveness depends strongly on how domain knowledge is represented. In this work, we develop two retrieval pipelines: a dense semantic vector-based approach (VectorRAG) and a graph-based approach (GraphRAG). Using over 1,000 polyhydroxyalkanoate (PHA) papers, we construct context-preserving paragraph embeddings and a canonicalized structured knowledge graph supporting entity disambiguation and multi-hop reasoning. We evaluate these pipelines through standard retrieval metrics, comparisons with general state-of-the-art systems such as GPT and Gemini, and qualitative validation by a domain chemist. The results show that GraphRAG achieves higher precision and interpretability, while VectorRAG provides broader recall, highlighting complementary trade-offs. Expert validation further confirms that the tailored pipelines, particularly GraphRAG, produce well-grounded, citation-reliable responses with strong domain relevance. By grounding every statement in evidence, these systems enable researchers to navigate the literature, compare findings across studies, and uncover patterns that are difficult to extract manually. More broadly, this work establishes a practical framework for building materials science assistants using curated corpora and retrieval design, reducing reliance on proprietary models while enabling trustworthy literature analysis at scale.

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

TL;DR

This work tackles the difficulty of mining unstructured polymer literature by building literature-grounded reasoning systems for PHAs. It develops two retrieval-augmented generation pipelines, VectorRAG (dense semantic) and GraphRAG (knowledge-graph–based), trained on a curated PHA corpus of 1,028 papers (44,609 paragraphs) and a broader ~3 million-document literature corpus. The authors implement context-preserving paragraph chunks and domain-aware entity normalization to enable multi-hop reasoning with interpretable evidence trails, and they validate the systems with 113 domain-expert questions plus expert evaluations, highlighting that GraphRAG offers higher precision and interpretability while VectorRAG provides broader recall. The study demonstrates a practical, transparent path to trustworthy literature analysis at scale, with an interactive interface and performance that rivals or surpasses web-backed systems in domain grounding, while reducing reliance on opaque proprietary models and enabling applicability to other materials domains.

Abstract

Polymer literature contains a large and growing body of experimental knowledge, yet much of it is buried in unstructured text and inconsistent terminology, making systematic retrieval and reasoning difficult. Existing tools typically extract narrow, study-specific facts in isolation, failing to preserve the cross-study context required to answer broader scientific questions. Retrieval-augmented generation (RAG) offers a promising way to overcome this limitation by combining large language models (LLMs) with external retrieval, but its effectiveness depends strongly on how domain knowledge is represented. In this work, we develop two retrieval pipelines: a dense semantic vector-based approach (VectorRAG) and a graph-based approach (GraphRAG). Using over 1,000 polyhydroxyalkanoate (PHA) papers, we construct context-preserving paragraph embeddings and a canonicalized structured knowledge graph supporting entity disambiguation and multi-hop reasoning. We evaluate these pipelines through standard retrieval metrics, comparisons with general state-of-the-art systems such as GPT and Gemini, and qualitative validation by a domain chemist. The results show that GraphRAG achieves higher precision and interpretability, while VectorRAG provides broader recall, highlighting complementary trade-offs. Expert validation further confirms that the tailored pipelines, particularly GraphRAG, produce well-grounded, citation-reliable responses with strong domain relevance. By grounding every statement in evidence, these systems enable researchers to navigate the literature, compare findings across studies, and uncover patterns that are difficult to extract manually. More broadly, this work establishes a practical framework for building materials science assistants using curated corpora and retrieval design, reducing reliance on proprietary models while enabling trustworthy literature analysis at scale.
Paper Structure (4 sections, 5 equations, 4 figures, 2 tables)

This paper contains 4 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a) Overview of the retrieval-augmented generation (RAG) framework for literature-grounded question answering on PHAs. The curated PHA corpus, comprising 1,028 articles and 44,609 parsed paragraphs, serves as the shared knowledge base for both VectorRAG and GraphRAG retrieval pipelines. Retrieved contextual evidence from each pipeline is provided to a LLM to generate literature-grounded responses. (b) Example scientific queries and corresponding responses generated by the Polymer Literature Scholar. Responses are synthesized by aggregating evidence retrieved from multiple relevant studies within the PHA corpus, as indicated by the document symbol.
  • Figure 2: VectorRAG workflow for literature-grounded question answering on PHAs. (a) Backend processing of the corpus, where full-text articles are parsed, contextually grouped into condensed text chunks, and embedded into a dense semantic space for similarity-based retrieval. (b) Retrieval-augmented inference, in which user queries are encoded in the same latent space and matched to the most semantically aligned text segments from the corpus. (c) Conceptual representation of the embedding space illustrating how relevant paragraphs are identified through cosine similarity between the query vector and its nearest neighbors. (d) Example of a representative query showing retrieved literature passages and the corresponding grounded, citation-linked response generated by the language model.
  • Figure 3: GraphRAG workflow for knowledge graph–based question answering on PHAs. (a) Backend processing of the corpus, where entities and relationships are extracted from literature, normalized, and stored as relational tuples in a relational database. (b) Retrieval and reasoning stage showing how user queries are decomposed into entity–relation pairs, matched to canonical entities, and re-ranked through a path-based scoring strategy to identify the most relevant subgraphs for grounded response generation. (c) Example of entity canonicalization, where multiple related mentions (e.g., PHB–Ag, maleated PHB, 3-armed PHB) are merged into a unified canonical node representing PHB. The accompanying bar plot highlights the top canonical entities ranked by cluster size, demonstrating how normalization improves graph connectivity and recall. (d) Representative example showing a user query, retrieved knowledge graph tuples, and the corresponding grounded response synthesized by the language model with supporting citations.
  • Figure 5: Domain-expert evaluation of five RAG pipelines across General, Paper-specific, and Multi-paper questions. Scores reflect how well each system balanced factual grounding, contextual coverage, and citation reliability. GraphRAG (GPT-4o-mini) and ChatGPT-5 with Web Search achieved the highest overall performance, with other pipelines showing moderate but consistent results.