AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers
Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Tuan Dung Nguyen, Rui Pan, Zechang Sun, Tijmen de Haan
TL;DR
AstroMLab 5 introduces a large-scale knowledge-graph-ready resource for astrophysics by generating structured six-section summaries and a dense 9,999-concept vocabulary for 408,590 astro-ph papers (1992–July 2025). The pipeline combines OCR-based PDF processing, chunked multi-stage summarization, and embedding-driven concept clustering to produce comprehensive paper representations that outperform traditional ADS keywords in coverage and balance. Embedding analyses reveal that concepts are distributed across semantic space, enabling discovery across diverse topics beyond what abstracts or summaries alone can provide. The dataset supports semantic search, knowledge graph construction, and AI-assisted literature exploration, with public code, data, and embeddings released to the community along with a plan for extrinsic evaluation in downstream systems.
Abstract
We present a dataset of 408,590 astrophysics papers from arXiv (astro-ph), spanning 1992 through July 2025. Each paper has been processed through a multi-stage pipeline to produce: (1) structured summaries organized into six semantic sections (Background, Motivation, Methodology, Results, Interpretation, Implication), and (2) concept extraction yielding 9,999 unique concepts with detailed descriptions. The dataset contains 3.8 million paper-concept associations and includes semantic embeddings for all concepts. Comparison with traditional ADS keywords reveals that the concepts provide denser coverage and more uniform distribution, while analysis of embedding space structure demonstrates that concepts are semantically dispersed within papers-enabling discovery through multiple diverse entry points. Concept vocabulary and embeddings are publicly released at https://github.com/tingyuansen/astro-ph_knowledge_graph.
