Table of Contents
Fetching ...

Knowledge Graph in Astronomical Research with Large Language Models: Quantifying Driving Forces in Interdisciplinary Scientific Discovery

Zechang Sun, Yuan-Sen Ting, Yaobo Liang, Nan Duan, Song Huang, Zheng Cai

TL;DR

This work addresses the challenge of quantifying how new ideas and technologies drive interdisciplinary astronomical research. It introduces an LLM‑assisted pipeline to extract concepts from 297,807 astronomy papers (1993–2024), constructs a knowledge graph linked by citation‑reference relevance, and analyzes the co‑evolution of concepts over time. The study finds a two‑phase adoption for numerical simulations and a peripheral but growing integration of machine learning concepts, with an approximate five‑year lag between technique development and impactful scientific use. Overall, the approach provides a quantitative framework to study interdisciplinarity in astronomy and can track how cross‑domain innovations emerge and diffuse across subfields.

Abstract

Identifying and predicting the factors that contribute to the success of interdisciplinary research is crucial for advancing scientific discovery. However, there is a lack of methods to quantify the integration of new ideas and technological advancements in astronomical research and how these new technologies drive further scientific breakthroughs. Large language models, with their ability to extract key concepts from vast literature beyond keyword searches, provide a new tool to quantify such processes. In this study, we extracted concepts in astronomical research from 297,807 publications between 1993 and 2024 using large language models, resulting in a set of 24,939 concepts. These concepts were then used to form a knowledge graph, where the link strength between any two concepts was determined by their relevance through the citation-reference relationships. By calculating this relevance across different time periods, we quantified the impact of numerical simulations and machine learning on astronomical research. The knowledge graph demonstrates two phases of development: a phase where the technology was integrated and another where the technology was explored in scientific discovery. The knowledge graph reveals that despite machine learning has made much inroad in astronomy, there is currently a lack of new concept development at the intersection of AI and Astronomy, which may be the current bottleneck preventing machine learning from further transforming the field of astronomy.

Knowledge Graph in Astronomical Research with Large Language Models: Quantifying Driving Forces in Interdisciplinary Scientific Discovery

TL;DR

This work addresses the challenge of quantifying how new ideas and technologies drive interdisciplinary astronomical research. It introduces an LLM‑assisted pipeline to extract concepts from 297,807 astronomy papers (1993–2024), constructs a knowledge graph linked by citation‑reference relevance, and analyzes the co‑evolution of concepts over time. The study finds a two‑phase adoption for numerical simulations and a peripheral but growing integration of machine learning concepts, with an approximate five‑year lag between technique development and impactful scientific use. Overall, the approach provides a quantitative framework to study interdisciplinarity in astronomy and can track how cross‑domain innovations emerge and diffuse across subfields.

Abstract

Identifying and predicting the factors that contribute to the success of interdisciplinary research is crucial for advancing scientific discovery. However, there is a lack of methods to quantify the integration of new ideas and technological advancements in astronomical research and how these new technologies drive further scientific breakthroughs. Large language models, with their ability to extract key concepts from vast literature beyond keyword searches, provide a new tool to quantify such processes. In this study, we extracted concepts in astronomical research from 297,807 publications between 1993 and 2024 using large language models, resulting in a set of 24,939 concepts. These concepts were then used to form a knowledge graph, where the link strength between any two concepts was determined by their relevance through the citation-reference relationships. By calculating this relevance across different time periods, we quantified the impact of numerical simulations and machine learning on astronomical research. The knowledge graph demonstrates two phases of development: a phase where the technology was integrated and another where the technology was explored in scientific discovery. The knowledge graph reveals that despite machine learning has made much inroad in astronomy, there is currently a lack of new concept development at the intersection of AI and Astronomy, which may be the current bottleneck preventing machine learning from further transforming the field of astronomy.
Paper Structure (13 sections, 4 equations, 4 figures)

This paper contains 13 sections, 4 equations, 4 figures.

Figures (4)

  • Figure 1: Schematic plot outlining the knowledge graph construction using large language model agents. The extraction of concepts comprises three main phases: (1) Concept Extraction, where agents construct scientific concepts from documents; (2) Vectorization and Nearest Neighbor Finding, in which concepts are vectorized and grouped by semantic similarity; (3) Concept Merging, where similar concepts are combined to form a more coarse-grained structures. The connections between concepts are then defined by citation-reference relevance as detailed in Section \ref{['subsec:link']}, with concepts involved in more citation-reference pairs assigned a higher relevance.
  • Figure 2: Visualization of a knowledge graph of 24,939 concepts, constructed from 297,807 astronomical research papers. Only concepts appearing in more than 20 papers and links with a link strength greater than 0.001 are displayed. Each concept is categorized into one of the following domains: (A) Galaxy Physics, (B) Cosmology & Nongalactic Physics, (C) Earth & Planetary Science, (D) High Energy Astrophysics, (E) Solar & Stellar Physics, (F) Statistics & AI, (G) Numerical Simulation, or (H) Instrumental Design. In the upper panels, we show connections between galaxy physics and other scientific domains. In the lower panel, we highlight the concepts from simulation, statistics, and observational instruments and their respective locations with respect to galaxy physics. Unsurprisingly, the technological concepts are generally more globally spread, as the same techniques can have wide implications for a broad range of topics in astronomy. Machine learning techniques are still at the periphery of the knowledge graph, suggesting that their integration in astronomy is still in its early stages. The interactive version of the knowledge graph is made publicly available at https://astrokg.github.io/.
  • Figure 3: The average linkage for five distinct time periods is used to investigate the temporal integration of technological techniques into scientific research. The middle and lower panels illustrate a consistent increase in the count of concepts, both in terms of scientific concepts (bottom panel) and technical concepts (middle panel). The upper panel shows the total cross-linkage between individual technical domains and scientific concepts, with higher values indicating stronger adoption. The upper panel reveals a two-phase evolution, with an observed latency of approximately five years. The two phases signify the period of development and introduction of new techniques in astronomy and their subsequent adoption by the community (see text for details). Machine learning has begun to reach integration levels comparable to those of numerical simulations seen two decades earlier. However, the number of concepts in machine learning within astronomical research has only increased rather marginally, rising from 152 between 1993 and 2000, to 215 from 2005 to 2010, and reaching 230 between 2015 and 2020.
  • Figure 4: Integration of machine learning in different subfields of astronomy. The integration is defined as the average cross-domain linkage similar to the top panel of Figure \ref{['fig:time-evo']}. Cosmology and Nongalactic Astrophysics currently lead the application of machine learning in astronomy, followed by Galaxy Physics and Solar & Stellar Physics. The adoption of machine learning concepts in Earth & Planetary Physics and High Energy Astrophysics still lags behind.