Table of Contents
Fetching ...

Determining Research Priorities Using Machine Learning

Brian Thomas, Harley Thronson, Anthony Buonomo, Louis Barbier

TL;DR

This study trains Latent Dirichlet Allocation (LDA) topic models on titles and abstracts from high‑impact astronomy journals (1998–2010), using SingleRank-based key-term extraction and SciSpacy lemmatization to derive 125 topics. It then defines Topic Contribution Score (TCS), TCS_CAGR, and Research Interest (RI) to quantify topic engagement and growth, and tests these metrics against the DS2010 science frontier content and Decadal Survey whitepapers, finding significant cross‑domain correlations. RI generally provides the strongest alignment with human priors, while TCS_CAGR shows a robust association with Mean Lifetime Citation Rate (MLCR), suggesting growth‑oriented topics predict future impact better than current popularity. Despite moderate explanatory power (mean R^2 around 0.4) and topic drift concerns, the results demonstrate practical potential for ML‑based metrics to aid planning (e.g., curated reading lists) and point to future improvements with newer NLP models such as AstroBERT or GPT‑4.

Abstract

We summarize our exploratory investigation into whether Machine Learning (ML) techniques applied to publicly available professional text can substantially augment strategic planning for astronomy. We find that an approach based on Latent Dirichlet Allocation (LDA) using content drawn from astronomy journal papers can be used to infer high-priority research areas. While the LDA models are challenging to interpret, we find that they may be strongly associated with meaningful keywords and scientific papers which allow for human interpretation of the topic models. Significant correlation is found between the results of applying these models to the previous decade of astronomical research ("1998-2010" corpus) and the contents of the science frontier panel report which contains high-priority research areas identified by the 2010 National Academies' Astronomy and Astrophysics Decadal Survey ("DS2010" corpus). Significant correlations also exist between model results of the 1998-2010 corpus and the submitted whitepapers to the Decadal Survey ("whitepapers" corpus). Importantly, we derive predictive metrics based on these results which can provide leading indicators of which content modeled by the topic models will become highly cited in the future. Using these identified metrics and the associations between papers and topic models it is possible to identify important papers for planners to consider. A preliminary version of our work was presented by Thronson etal. 2021 and Thomas etal. 2022.

Determining Research Priorities Using Machine Learning

TL;DR

This study trains Latent Dirichlet Allocation (LDA) topic models on titles and abstracts from high‑impact astronomy journals (1998–2010), using SingleRank-based key-term extraction and SciSpacy lemmatization to derive 125 topics. It then defines Topic Contribution Score (TCS), TCS_CAGR, and Research Interest (RI) to quantify topic engagement and growth, and tests these metrics against the DS2010 science frontier content and Decadal Survey whitepapers, finding significant cross‑domain correlations. RI generally provides the strongest alignment with human priors, while TCS_CAGR shows a robust association with Mean Lifetime Citation Rate (MLCR), suggesting growth‑oriented topics predict future impact better than current popularity. Despite moderate explanatory power (mean R^2 around 0.4) and topic drift concerns, the results demonstrate practical potential for ML‑based metrics to aid planning (e.g., curated reading lists) and point to future improvements with newer NLP models such as AstroBERT or GPT‑4.

Abstract

We summarize our exploratory investigation into whether Machine Learning (ML) techniques applied to publicly available professional text can substantially augment strategic planning for astronomy. We find that an approach based on Latent Dirichlet Allocation (LDA) using content drawn from astronomy journal papers can be used to infer high-priority research areas. While the LDA models are challenging to interpret, we find that they may be strongly associated with meaningful keywords and scientific papers which allow for human interpretation of the topic models. Significant correlation is found between the results of applying these models to the previous decade of astronomical research ("1998-2010" corpus) and the contents of the science frontier panel report which contains high-priority research areas identified by the 2010 National Academies' Astronomy and Astrophysics Decadal Survey ("DS2010" corpus). Significant correlations also exist between model results of the 1998-2010 corpus and the submitted whitepapers to the Decadal Survey ("whitepapers" corpus). Importantly, we derive predictive metrics based on these results which can provide leading indicators of which content modeled by the topic models will become highly cited in the future. Using these identified metrics and the associations between papers and topic models it is possible to identify important papers for planners to consider. A preliminary version of our work was presented by Thronson etal. 2021 and Thomas etal. 2022.
Paper Structure (3 sections, 7 equations, 10 figures, 3 tables)

This paper contains 3 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Topic Modeling Pipeline. The diagram indicates how we create our topic models using refereed journal articles from high-impact journals (Appendix A). Steps A through E indicate key processing points which are described in the text.
  • Figure 2: Gauging stability of derived topic models. The diagram indicates how interrelated topic models are between separate runs of LDA (trained using the 1998$-$2010 corpus) using different random seeds. Ten runs generating topic models were created and topics in each run then cross-compared to topics in other runs using cosine similarity based on the strongly associated term for each topic. “Mean Similarity,” a gauge of topic stability, was calculated in two steps. First by finding the best (highest) cosine similarity for a given topic to all other topics in another run. This was then repeated for all runs to yield nine measurements of cosine similarity for the topic which were then averaged to yield the mean similarity.
  • Figure 3: Topic Contribution Score Pipeline. This shows the pipeline used to calculate “Topic Contribution Score” (TCS) or the total contribution of a topic to an example corpus of three documents and four topics. This pipeline is used for the determination of topic contributions for our various corpora which include the 1998$-$2010 corpus (documents are the journal papers abstracts and titles), the Decadal Survey (documents are the text blocks found in the science frontier panel chapters 1-4), and submitted whitepaper content (documents are the whole text of the papers). Detailed description of steps A through D appear in the text.
  • Figure 4: Sample Topic Timeseries. Three different example topic time series for the 1998$-$2010 corpus appear in this figure which exemplify common behaviors seen in the population of topic timeseries.
  • Figure 5: Literature Research Interest compared to TCS for Decadal Survey content. An example plot of one run comparing the 1998 - 2010 literature metric RI$_{1998-2010}$ versus the 2010 Decadal Survey Topic Contribution Score (TCS$_{DS2010}$) by topic. Each red dot represents a topic. The data indicate a significant, but weak-to-moderate, correlation exists. The mean Spearman correlation for ten runs is $R_{mean} = 0.57 \pm 0.02$. Estimated errors are of the same size or smaller than the symbols.
  • ...and 5 more figures