Table of Contents
Fetching ...

Identifying Telescope Usage in Astrophysics Publications: A Machine Learning Framework for Institutional Research Management at Observatories

Vicente Amado Olivo, Wolfgang Kerzendorf, Brian Cherinka, Joshua V. Shields, Annie Didier, Katharina von der Wense

TL;DR

This work introduces a machine learning classification framework for the automatic identification of facility usage of observation sections in astrophysics publications and demonstrates robustness when compared to other approaches, considering common metrics and computational complexity.

Abstract

Large scientific institutions, such as the Space Telescope Science Institute, track the usage of their facilities to understand the needs of the research community. Astrophysicists incorporate facility usage data into their scientific publications, embedding this information in plain-text. Traditional automatic search queries prove unreliable for accurate tracking due to the misidentification of facility names in plain-text. As automatic search queries fail, researchers are required to manually classify publications for facility usage, which consumes valuable research time. In this work, we introduce a machine learning classification framework for the automatic identification of facility usage of observation sections in astrophysics publications. Our framework identifies sentences containing telescope mission keywords (e.g., Kepler and TESS) in each publication. Subsequently, the identified sentences are transformed using Term Frequency-Inverse Document Frequency and classified with a Support Vector Machine. The classification framework leverages the context surrounding the identified telescope mission keywords to provide relevant information to the classifier. The framework successfully classifies usage of MAST hosted missions with a 92.9% accuracy. Furthermore, our framework demonstrates robustness when compared to other approaches, considering common metrics and computational complexity. The framework's interpretability makes it adaptable for use across observatories and other scientific facilities worldwide.

Identifying Telescope Usage in Astrophysics Publications: A Machine Learning Framework for Institutional Research Management at Observatories

TL;DR

This work introduces a machine learning classification framework for the automatic identification of facility usage of observation sections in astrophysics publications and demonstrates robustness when compared to other approaches, considering common metrics and computational complexity.

Abstract

Large scientific institutions, such as the Space Telescope Science Institute, track the usage of their facilities to understand the needs of the research community. Astrophysicists incorporate facility usage data into their scientific publications, embedding this information in plain-text. Traditional automatic search queries prove unreliable for accurate tracking due to the misidentification of facility names in plain-text. As automatic search queries fail, researchers are required to manually classify publications for facility usage, which consumes valuable research time. In this work, we introduce a machine learning classification framework for the automatic identification of facility usage of observation sections in astrophysics publications. Our framework identifies sentences containing telescope mission keywords (e.g., Kepler and TESS) in each publication. Subsequently, the identified sentences are transformed using Term Frequency-Inverse Document Frequency and classified with a Support Vector Machine. The classification framework leverages the context surrounding the identified telescope mission keywords to provide relevant information to the classifier. The framework successfully classifies usage of MAST hosted missions with a 92.9% accuracy. Furthermore, our framework demonstrates robustness when compared to other approaches, considering common metrics and computational complexity. The framework's interpretability makes it adaptable for use across observatories and other scientific facilities worldwide.

Paper Structure

This paper contains 15 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The filtered dataset comprises sentences that include MAST hosted mission keywords. For example, when filtering publications for the space telescope 'Kepler' we identify three sentences shown in these two publications. Contextual cues guide the reader or classifier to distinguish that, on the left, one publication references the 'Kepler' space telescope and is a MAST publication, while on the right, the other publication discusses 'Keplerian physics' and the research does not utilize the Kepler space telescope.
  • Figure 2: Our classification framework, presented here, follows a supervised training structure. First we identify relevant sentences in each publication containing mission keywords (e.g., Kepler and TESS) and then we vectorize and classify the text.
  • Figure 3: We present the Receiver Operating Characteristic (ROC) curves displaying the performances of the SVM, Random Forest, and MLP classifiers. Notably, the SVM model trained exclusively on sentences containing mission keywords, has an Area Under the Curve (AUC) of 0.97. An AUC of 0.5 indicates that the model's ability to differentiate between the two classes is no better than random chance, while an AUC of one indicates the model's ability to perfectly distinguish between the two classes. The red point displays the optimum true-positive and false-positive rates from the classifier in paper C22, with a true-positive rate of 0.93 and a false-positive rate of 0.07. In comparison, the dark green point displays that our SVM model has a higher true-positive rate of 0.96. However, this improvement is accompanied by a slightly higher false-positive rate of 0.1, indicating the inclusion of some irrelevant papers.
  • Figure 4: A confusion matrix delineates the rates of predicted and true labels, enabling the evaluation of true and false predictions for each label in our binary classification maria_navin_performance_2016. The SVM model exhibits a true-positive rate of 96% in accurately predicting labeled MAST publications and a true-negative rate of 90% in correctly identifying labeled Not MAST publications. In comparison, both the Random Forest model and MLP have lower true-positive rates, while the MLP has a slightly higher true-negative rate.
  • Figure 5: The ROC curve of the SVM model trained on the full-text publications has an Area Under the Curve (AUC) of 0.90 compared to 0.97 when identifying the sections of the publication containing keywords.
  • ...and 1 more figures