pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy

Kartheik G. Iyer; Mikaeel Yunus; Charles O'Neill; Christine Ye; Alina Hyk; Kiera McCormick; Ioana Ciuca; John F. Wu; Alberto Accomazzi; Simone Astarita; Rishabh Chakrabarty; Jesse Cranney; Anjalie Field; Tirthankar Ghosal; Michele Ginolfi; Marc Huertas-Company; Maja Jablonska; Sandor Kruk; Huiling Liu; Gabriel Marchidan; Rohit Mistry; J. P. Naiman; J. E. G. Peek; Mugdha Polimera; Sergio J. Rodriguez; Kevin Schawinski; Sanjib Sharma; Michael J. Smith; Yuan-Sen Ting; Mike Walmsley

pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy

Kartheik G. Iyer, Mikaeel Yunus, Charles O'Neill, Christine Ye, Alina Hyk, Kiera McCormick, Ioana Ciuca, John F. Wu, Alberto Accomazzi, Simone Astarita, Rishabh Chakrabarty, Jesse Cranney, Anjalie Field, Tirthankar Ghosal, Michele Ginolfi, Marc Huertas-Company, Maja Jablonska, Sandor Kruk, Huiling Liu, Gabriel Marchidan, Rohit Mistry, J. P. Naiman, J. E. G. Peek, Mugdha Polimera, Sergio J. Rodriguez, Kevin Schawinski, Sanjib Sharma, Michael J. Smith, Yuan-Sen Ting, Mike Walmsley

TL;DR

Pathfinder addresses the challenge of rapidly expanding astronomical literature by combining semantic, natural-language querying with LLM-based synthesis over a large ADS/arXiv corpus. It integrates a Retrieval-Augmented Generation pipeline with HyDE query expansion, ReAct agents, and a two-stage reranking framework to produce fact-grounded, low-hallucination answers. The framework is evaluated with synthetic benchmarks and a real-world Gold QA dataset, demonstrating improved retrieval quality and informative, context-aware responses, plus tools for visualization and mission-impact analysis. By enabling multilingual and audience-tailored outputs and providing a public, open-source platform, Pathfinder aims to democratize access to astronomical knowledge while outlining clear paths for future enhancements and limitations.

Abstract

The exponential growth of astronomical literature poses significant challenges for researchers navigating and synthesizing general insights or even domain-specific knowledge. We present Pathfinder, a machine learning framework designed to enable literature review and knowledge discovery in astronomy, focusing on semantic searching with natural language instead of syntactic searches with keywords. Utilizing state-of-the-art large language models (LLMs) and a corpus of 350,000 peer-reviewed papers from the Astrophysics Data System (ADS), Pathfinder offers an innovative approach to scientific inquiry and literature exploration. Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context as a complement to currently existing methods that use keywords or citation graphs. It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes. We demonstrate the tool's versatility through case studies, showcasing its application in various research scenarios. The system's performance is evaluated using custom benchmarks, including single-paper and multi-paper tasks. Beyond literature review, Pathfinder offers unique capabilities for reformatting answers in ways that are accessible to various audiences (e.g. in a different language or as simplified text), visualizing research landscapes, and tracking the impact of observatories and methodologies. This tool represents a significant advancement in applying AI to astronomical research, aiding researchers at all career stages in navigating modern astronomy literature.

pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy

TL;DR

Abstract

Paper Structure (33 sections, 2 equations, 9 figures, 1 table)

This paper contains 33 sections, 2 equations, 9 figures, 1 table.

Introduction
Dataset
Building pathfinder
Generation
Generating Embeddings
Text generation with RAG
Text generation with ReAct agents
Retrieval
Semantic search & embeddings
Generating keywords from abstracts
Weighting schemes: Keywords, Timestamps and Citations
Query expansion and HyDE
Reranking
Outliers and consensus
Benchmarks and evaluation
...and 18 more sections

Figures (9)

Figure 1: Schematic showing the overall pathfinder pipeline.
Figure 2: A heatmap showing a 2D UMAP projection of the 1536 dimensional embedding space of that shows the different areas of the astro-ph literature corpus. The heatmap color denotes the density of papers in different parts of the corpus, with the auto-tagging keywords at various locations shown to illustrate the way the embeddings group the different topics by semantic similarity. Similar to a world map, the axes here do not hold a particular meaning. Regions close to each other hold a semantic similarity, while distant regions do not.
Figure 3: Similar to Figure \ref{['fig:galaxymap']}, but showing the loci of the top-level unified astronomy thesaurus (UAT) heirarchical keywords projected into the embedding space. Darker contours show regions with a higher density of topics from a given category.
Figure 4: Top-k retrieved papers for three different example queries, visualized in the two-dimensional UMAP space. Red points are outliers; blue points are non-outliers. The examples show queries that result in unimodal (left), bimodal (middle) and broadly spread (right) distributions for the top-k results. Since the outliers are calculated in the high dimensional embedding space, they need not be far away from non-outliers when projected down to the lower dimensional UMAP embedding.
Figure 5: Normalised single document benchmark and multi-document benchmark scores across methods. Single document scores consist of an average of reciprocal rank and success rate in retrieving the correct paper in the top 10 documents, normalised so the scores sum to 1. Similarly, the multi-document scores are an average of Normalised Discounted Cumulative Gain (NDCG) at 100 documents and recall at 100 documents, again normalised. A combination of HYDE and reranking (HydeCohereRerank) was the best performing system, outperforming HYDE alone, base semantic search (with just the embeddings cosine similarity between query and documents) and a simple bag-of-words system.
...and 4 more figures

pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy

TL;DR

Abstract

pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy

Authors

TL;DR

Abstract

Table of Contents

Figures (9)