Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

Markus J. Buehler

Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

Markus J. Buehler

TL;DR

This work presents a framework to accelerate scientific discovery by converting a large corpus of papers into an ontological knowledge graph using generative AI. It combines text distillation, triple extraction, and global graph assembly to create a scale-free network with a large giant component, enabling complex graph-based reasoning and cross-domain connections, including isomorphisms with artistic domains. The authors introduce multimodal reasoning by integrating images and painting-inspired prompts, and they demonstrate data augmentation through adversarial multi-agent modeling to continuously expand the knowledge graph. The approach yields novel design insights, e.g., isomorphic mappings between biology and art and the potential to guide material design (such as nacre-inspired composites) through graph-guided reasoning, with implications for identifying knowledge gaps and proposing interdisciplinary innovations.

Abstract

Leveraging generative Artificial Intelligence (AI), we have transformed a dataset comprising 1,000 scientific papers into an ontological knowledge graph. Through an in-depth structural analysis, we have calculated node degrees, identified communities and connectivities, and evaluated clustering coefficients and betweenness centrality of pivotal nodes, uncovering fascinating knowledge architectures. The graph has an inherently scale-free nature, is highly connected, and can be used for graph reasoning by taking advantage of transitive and isomorphic properties that reveal unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, propose never-before-seen material designs, and predict material behaviors. We compute deep node embeddings for combinatorial node similarity ranking for use in a path sampling strategy links dissimilar concepts that have previously not been related. One comparison revealed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. In another example, the algorithm proposed a hierarchical mycelium-based composite based on integrating path sampling with principles extracted from Kandinsky's 'Composition VII' painting. The resulting material integrates an innovative set of concepts that include a balance of chaos/order, adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across science, technology and art, revealing a nuanced ontology of immanence that reveal a context-dependent heterarchical interplay of constituents. Graph-based generative AI achieves a far higher degree of novelty, explorative capacity, and technical detail, than conventional approaches and establishes a widely useful framework for innovation by revealing hidden connections.

Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 14 figures, 13 tables)

This paper contains 26 sections, 6 equations, 14 figures, 13 tables.

Introduction
Results and discussion
Construction and analysis of the global graph
Extraction of multiple graph traversal paths via ranked combinatorial analysis of cosine similarities
Reasoning over the graph: Graph traversal based question answering
Isomorphism analysis across distinct graph structures
Multimodal knowledge generation and incorporation into augmented graphs
Generating new data through conversations with a complex generative model and incorporation into augmented graphs
Agentic modeling for adversarial knowledge generation and incorporation into augmented graphs
Incorporation of new data from scientific literature towards the design of sustainable mycelium composite materials
Joint analysis of artistic images with graph reasoning and image synthesis for hierarchical materials design
Conclusion
Key technical insights about materials
Limitations and opportunities
Materials and Methods
...and 11 more sections

Figures (14)

Figure 1: Overview of the approach used here. Panel a depicts the strategic objective to convert information (the answer to "who," "what," "where," and "when" questions) into knowledge (about "how"). While information is relatively easily accessible and can be recorded in books, it can be transmitted easily. Knowledge, in contrast, is typically harder to communicate, and can be hard to transfer from one person to another. Panel b shows how we solve this problem, by first converting a set of scientific articles into markup language, then into text chunks, which then form the basis to distill their content into a concise scientific summary. The raw content then forms the basis to generate triples for a graph, first created at the level of each of the text chunks, and then assembled into a global graph by concatenating all local graphs.
Figure 2: Overview of the global graph (panel a), multiple magnifications (panel b) and illustration of the deep and wide connectivity of nodes (panel c). Panel b depicts the entire graph, followed by successively zoomed in views of the graph structure. At the highest magnification, individual nodes and node labels become visible. Panel c shows a similar progression, albeit with one of the nodes, 'nacre', highlighted (and the rest greyed out), revealing the wide-ranging connections across the global graph. Such highly connected nodes are essential for the knowledge graph's functionality, acting as central hubs that enhance its ability to represent, access, and discover scientific knowledge.
Figure 3: Summary of graph statistics of the global graph, complementing the analysis in Table \ref{['tab:table_graphproperties_comb']}. Panel a shows a log-log plot of the degree distribution, and panels b and c a principal component analysis of the node embeddings (for 5 clusters in b and 10 clusters in c). Panels d-f show the same analysis, but for the giant component of the graph only. For the plots in panels and d We use log1p to transform node degrees before plotting provides a clearer and more interpretable visualization by handling zero values and reducing skewness. This transformation spreads the data more evenly across the histogram bins, highlighting patterns and variability that may be obscured when plotting raw degrees directly.
Figure 4: Comprehensive analysis of the structural properties of communities within a network, showing size of all communities (a), average node degree for each community (b), the average clustering coefficient for each community (c), and the average betweenness centrality of the nodes in each community (d). Panel b illustrates the average node degree per community, demonstrating generally consistent internal connectivity with notable outliers, indicative of more densely interconnected communities. Panel c explores the average clustering coefficient, revealing that while most communities do not show a propensity for tight clustering, a select few deviate with higher coefficients, suggesting localized pockets of closely-knit nodes. Panel d examines the average betweenness centrality for the most influential nodes in each community, displaying a rather even distribution across the network with slight variations, implying a distributed rather than centralized control over the network's connectivity. These metrics provide insight into the network's topology, highlighting the balance between uniformly distributed influence and the existence of specialized clusters within the network's architecture. Panel e depicts an analysis of community structure in the graph, showing the average number of edges within communities to assess how many edges are there on average that connect nodes within the same community (left). The data on the right depicts the average inter-community Edges designating the average number of edges that connect nodes from different communities. This data underscores the finding that this network seems to exhibit strong community structure, with more connections within communities than between them. Panel f shows the degree distribution of the global network on a log-log scale, with the empirical data in blue and the best-fit power-law model in a dashed red line. The power-law fit appears to follow the distribution of the data reasonably well, especially in the tail (high-degree region).
Figure 5: This plot illustrates the relationship between community size (number of nodes) and average clustering coefficient for different communities within the knowledge graph. Each point represents a community, with the color indicating the average degree of the nodes in that community, ranging from blue (lower average degree) to red (higher average degree). The $x$-axis and $y$-axis are on a logarithmic scale to capture the distribution across several orders of magnitude.
...and 9 more figures

Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

TL;DR

Abstract

Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)