Table of Contents
Fetching ...

AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers

Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Tuan Dung Nguyen, Rui Pan, Zechang Sun, Tijmen de Haan

TL;DR

AstroMLab 5 introduces a large-scale knowledge-graph-ready resource for astrophysics by generating structured six-section summaries and a dense 9,999-concept vocabulary for 408,590 astro-ph papers (1992–July 2025). The pipeline combines OCR-based PDF processing, chunked multi-stage summarization, and embedding-driven concept clustering to produce comprehensive paper representations that outperform traditional ADS keywords in coverage and balance. Embedding analyses reveal that concepts are distributed across semantic space, enabling discovery across diverse topics beyond what abstracts or summaries alone can provide. The dataset supports semantic search, knowledge graph construction, and AI-assisted literature exploration, with public code, data, and embeddings released to the community along with a plan for extrinsic evaluation in downstream systems.

Abstract

We present a dataset of 408,590 astrophysics papers from arXiv (astro-ph), spanning 1992 through July 2025. Each paper has been processed through a multi-stage pipeline to produce: (1) structured summaries organized into six semantic sections (Background, Motivation, Methodology, Results, Interpretation, Implication), and (2) concept extraction yielding 9,999 unique concepts with detailed descriptions. The dataset contains 3.8 million paper-concept associations and includes semantic embeddings for all concepts. Comparison with traditional ADS keywords reveals that the concepts provide denser coverage and more uniform distribution, while analysis of embedding space structure demonstrates that concepts are semantically dispersed within papers-enabling discovery through multiple diverse entry points. Concept vocabulary and embeddings are publicly released at https://github.com/tingyuansen/astro-ph_knowledge_graph.

AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers

TL;DR

AstroMLab 5 introduces a large-scale knowledge-graph-ready resource for astrophysics by generating structured six-section summaries and a dense 9,999-concept vocabulary for 408,590 astro-ph papers (1992–July 2025). The pipeline combines OCR-based PDF processing, chunked multi-stage summarization, and embedding-driven concept clustering to produce comprehensive paper representations that outperform traditional ADS keywords in coverage and balance. Embedding analyses reveal that concepts are distributed across semantic space, enabling discovery across diverse topics beyond what abstracts or summaries alone can provide. The dataset supports semantic search, knowledge graph construction, and AI-assisted literature exploration, with public code, data, and embeddings released to the community along with a plan for extrinsic evaluation in downstream systems.

Abstract

We present a dataset of 408,590 astrophysics papers from arXiv (astro-ph), spanning 1992 through July 2025. Each paper has been processed through a multi-stage pipeline to produce: (1) structured summaries organized into six semantic sections (Background, Motivation, Methodology, Results, Interpretation, Implication), and (2) concept extraction yielding 9,999 unique concepts with detailed descriptions. The dataset contains 3.8 million paper-concept associations and includes semantic embeddings for all concepts. Comparison with traditional ADS keywords reveals that the concepts provide denser coverage and more uniform distribution, while analysis of embedding space structure demonstrates that concepts are semantically dispersed within papers-enabling discovery through multiple diverse entry points. Concept vocabulary and embeddings are publicly released at https://github.com/tingyuansen/astro-ph_knowledge_graph.

Paper Structure

This paper contains 19 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Distribution of keywords/concepts per paper (left) and frequency distribution (right). ADS keywords show high sparsity with many papers having few keywords, while our concepts provide consistent coverage. The frequency distribution (right) reveals a large pile-up of overly generic terms and an extended tail of overly specific identifiers, while our concepts maintain more balanced intermediate granularity.
  • Figure 2: UMAP projections of concept (grey symbols) and summary embeddings (colored diamonds) and the abstract (gold star) for four representative papers. Faint gray background shows all 9,999 concepts in the vocabulary. Even papers classified as "low dispersion" (bottom row) have concepts spread across distinct semantic regions, showing that abstracts (and summaries) cannot capture the full conceptual diversity present in papers, unlike concepts.
  • Figure 3: Temporal evolution of concept vocabulary across three decades. (a) Number of new concepts emerging each year (crossing the 5-paper threshold). (b) Cumulative growth of the concept vocabulary. The rapid expansion in the early years reflects foundational concepts when arXiv began. A secondary peak in 2007 corresponds to cross-listing policy changes.
  • Figure 4: Evolution of concept co-occurrence in astrophysics. Darker colors indicate stronger co-occurrence. (a) Early period (1992--2003): established domain structure. (b) Recent period (2023--2025): computational domains (Statistics/AI, Numerical Simulation) show increased internal coherence and enhanced cross-domain integration with traditional astrophysical domains, reflecting the field's evolution toward data-intensive research.