Table of Contents
Fetching ...

Predicting New Concept-Object Associations in Astronomy by Mining the Literature

Jinchu Li, Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Nesar Ramachandra

TL;DR

Results indicate that historical literature encodes predictive structure not captured by global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for scarce telescope time.

Abstract

We construct a concept-object knowledge graph from the full astro-ph corpus through July 2025. Using an automated pipeline, we extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then test whether historical graph structure can forecast new concept-object associations before they appear in print. Because the concepts are derived from clustering and therefore overlap semantically, we apply an inference-time concept-similarity smoothing step uniformly to all methods. Across four temporal cutoffs on a physically meaningful subset of concepts, an implicit-feedback matrix factorization model (alternating least squares, ALS) with smoothing outperforms the strongest neighborhood baseline (KNN using text-embedding concept similarity) by 16.8% on NDCG@100 (0.144 vs 0.123) and 19.8% on Recall@100 (0.175 vs 0.146), and exceeds the best recency heuristic by 96% and 88%, respectively. These results indicate that historical literature encodes predictive structure not captured by global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for scarce telescope time.

Predicting New Concept-Object Associations in Astronomy by Mining the Literature

TL;DR

Results indicate that historical literature encodes predictive structure not captured by global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for scarce telescope time.

Abstract

We construct a concept-object knowledge graph from the full astro-ph corpus through July 2025. Using an automated pipeline, we extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then test whether historical graph structure can forecast new concept-object associations before they appear in print. Because the concepts are derived from clustering and therefore overlap semantically, we apply an inference-time concept-similarity smoothing step uniformly to all methods. Across four temporal cutoffs on a physically meaningful subset of concepts, an implicit-feedback matrix factorization model (alternating least squares, ALS) with smoothing outperforms the strongest neighborhood baseline (KNN using text-embedding concept similarity) by 16.8% on NDCG@100 (0.144 vs 0.123) and 19.8% on Recall@100 (0.175 vs 0.146), and exceeds the best recency heuristic by 96% and 88%, respectively. These results indicate that historical literature encodes predictive structure not captured by global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for scarce telescope time.
Paper Structure (25 sections, 11 equations, 6 figures, 7 tables)

This paper contains 25 sections, 11 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Pipeline overview. Top left: the astro-ph corpus (408,590 papers) is processed via OCR and LLM extraction to produce 9,999 concepts and raw object mentions, which are resolved through SIMBAD into 100,560 unique celestial objects. Top right: these are combined into a concept--object knowledge graph, split at a temporal cutoff $T$ into observed edges ($\le T$, gray) and new edges ($> T$, purple). Bottom left: ALS approximates the interaction matrix as a product of low-rank concept and object factor matrices. Bottom right: at inference, the model ranks candidate objects for a query concept to predict future associations.
  • Figure 2: Visual overview of baselines and evaluation metrics. Left: how each method ranks candidate objects for a query concept---from top to bottom: Random (shuffled ordering), Popularity (global frequency), RecentPopularity (time-windowed frequency), ConceptKNN-AA (Adamic--Adar neighbor aggregation), ConceptKNN-TextEmb (embedding-based neighbor aggregation), and ALS (dot product of learned latent factors). Right: how the three evaluation metrics score a ranked list; green checkmarks denote correct held-out associations. MRR rewards placing the first correct object high, Recall@$K$ measures coverage in the top $K$, and NDCG@100 assigns position-discounted credit.
  • Figure 3: Radar plot comparing all methods on the physical concept subset with concept smoothing. The four axes correspond to MRR, Recall@10, Recall@100, and NDCG@100, each normalized so the best method equals 1.0. Methods shown: Random (gray dotted), Popularity (red dashed), RecentPopularity with $\Delta\in\{3,5\}$ (orange/yellow dashed), ConceptKNN-AA (teal dash-dot), ConceptKNN-TextEmb (blue dash-dot), and ALS (dark solid). ALS forms the outermost polygon, leading on all four metrics.
  • Figure 4: Radar plot on the physical concept subset without smoothing. ALS leads on Recall@100 and NDCG@100; ConceptKNN-TextEmb is competitive on MRR and Recall@10.
  • Figure 5: Metric trends across cutoff years (Physical subset). Each row shows one metric (top to bottom: MRR, Recall@10, Recall@100, NDCG@100); left column: without smoothing, right column: with smoothing. Lines correspond to ALS (blue), Popularity (orange), RecentPopularity with best $\Delta$ (green), ConceptKNN-AA with best $k$ (red), and ConceptKNN-TextEmb with best $k$ (purple). Smoothing uniformly improves all methods; with smoothing, ALS leads on all metrics at every cutoff.
  • ...and 1 more figures