Table of Contents
Fetching ...

Large Language Model Driven Analysis of General Coordinates Network (GCN) Circulars

Vidushi Sharma, Ronit Agarwala, Judith L. Racusin, Leo P. Singer, Tyler Barna, Eric Burns, Michael W. Coughlin, Dakota Dutko, Courey Elliott, Rahul Gupta, Ashish Mahabal, Nikhil Mukund

TL;DR

The work demonstrates a practical, open-source pipeline that leverages BERTopic for neural topic modeling, Mistral 7B Instruct for topic summarization and information extraction, and LangChain-based RAG to automate parsing of the GCN Circular archive. It shows that unsupervised and supervised topic clustering can reveal astrophysical themes and multi-messenger activity trends, while zero-shot extraction coupled with retrieval-augmented generation can achieve high accuracy in GRB redshift extraction against Swift data. The approach reduces manual curation, enables scalable text mining, and provides a foundation for real-time, AI-assisted follow-up in transient astronomy, albeit with limitations tied to hardware, prompt engineering, and potential hallucinations. Overall, the paper highlights the viability of LLM-powered, open-source NLP pipelines to enhance the utility of the GCN Circulars for the astronomy community and future multi-messenger alert systems.

Abstract

The General Coordinates Network (GCN) is NASA's time-domain and multi-messenger alert system. GCN distributes two data products - automated ``Notices,'' and human-generated ``Circulars,'' that report the observations of high-energy and multi-messenger astronomical transients. The flexible and non-structured format of GCN Circulars, comprising of more than 40500 Circulars accumulated over three decades, makes it challenging to manually extract observational information, such as redshift or observed wavebands. In this work, we employ large language models (LLMs) to facilitate the automated parsing of transient reports. We develop a neural topic modeling pipeline with open-source tools for the automatic clustering and summarization of astrophysical topics in the Circulars database. Using neural topic modeling and contrastive fine-tuning, we classify Circulars based on their observation wavebands and messengers. Additionally, we separate gravitational wave (GW) event clusters and their electromagnetic (EM) counterparts from the Circulars database. Finally, using the open-source Mistral model, we implement a system to automatically extract gamma-ray burst (GRB) redshift information from the Circulars archive, without the need for any training. Evaluation against the manually curated Neil Gehrels Swift Observatory GRB table shows that our simple system, with the help of prompt-tuning, output parsing, and retrieval augmented generation (RAG), can achieve an accuracy of 97.2 % for redshift-containing Circulars. Our neural search enhanced RAG pipeline accurately retrieved 96.8 % of redshift circulars from the manually curated database. Our study demonstrates the potential of LLMs, to automate and enhance astronomical text mining, and provides a foundation work for future advances in transient alert analysis.

Large Language Model Driven Analysis of General Coordinates Network (GCN) Circulars

TL;DR

The work demonstrates a practical, open-source pipeline that leverages BERTopic for neural topic modeling, Mistral 7B Instruct for topic summarization and information extraction, and LangChain-based RAG to automate parsing of the GCN Circular archive. It shows that unsupervised and supervised topic clustering can reveal astrophysical themes and multi-messenger activity trends, while zero-shot extraction coupled with retrieval-augmented generation can achieve high accuracy in GRB redshift extraction against Swift data. The approach reduces manual curation, enables scalable text mining, and provides a foundation for real-time, AI-assisted follow-up in transient astronomy, albeit with limitations tied to hardware, prompt engineering, and potential hallucinations. Overall, the paper highlights the viability of LLM-powered, open-source NLP pipelines to enhance the utility of the GCN Circulars for the astronomy community and future multi-messenger alert systems.

Abstract

The General Coordinates Network (GCN) is NASA's time-domain and multi-messenger alert system. GCN distributes two data products - automated ``Notices,'' and human-generated ``Circulars,'' that report the observations of high-energy and multi-messenger astronomical transients. The flexible and non-structured format of GCN Circulars, comprising of more than 40500 Circulars accumulated over three decades, makes it challenging to manually extract observational information, such as redshift or observed wavebands. In this work, we employ large language models (LLMs) to facilitate the automated parsing of transient reports. We develop a neural topic modeling pipeline with open-source tools for the automatic clustering and summarization of astrophysical topics in the Circulars database. Using neural topic modeling and contrastive fine-tuning, we classify Circulars based on their observation wavebands and messengers. Additionally, we separate gravitational wave (GW) event clusters and their electromagnetic (EM) counterparts from the Circulars database. Finally, using the open-source Mistral model, we implement a system to automatically extract gamma-ray burst (GRB) redshift information from the Circulars archive, without the need for any training. Evaluation against the manually curated Neil Gehrels Swift Observatory GRB table shows that our simple system, with the help of prompt-tuning, output parsing, and retrieval augmented generation (RAG), can achieve an accuracy of 97.2 % for redshift-containing Circulars. Our neural search enhanced RAG pipeline accurately retrieved 96.8 % of redshift circulars from the manually curated database. Our study demonstrates the potential of LLMs, to automate and enhance astronomical text mining, and provides a foundation work for future advances in transient alert analysis.

Paper Structure

This paper contains 27 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Prompt template used for topic summary generation. For every topic cluster [DOCUMENTS] gets replaced by 3 concatenated sample Circulars from each topic, while [KEYWORDS] gets replaced by the representative keywords extracted using c-TF-IDF. Prompt template modified from default template provided in BERTopic documentation.
  • Figure 2: Visualization of the GCN topic modeling pipelines, built with the help of the BERTopic library. The blue pipeline depicts the steps for the unsupervised discovery of astrophysical topics. Circulars are embedded with all-MiniLM-L6-v2, after which the vectors are reduced and clustered. Keywords are extracted after stopwords removal and the discovered topics are summarized with Mistral 7B Instruct. The red pipeline describes the supervised process for generating observation-based clusters. The all-MiniLM-L6-v2 model is fine-tuned on a labeled dataset of Circulars. The Circulars and a list of observation labels are then embedded using this model. Finally, both sets of embeddings are compared with each other using the cosine similarity metric to find the closest observation label for each Circular in the embedding space.
  • Figure 3: GCN topic clusters after reduction with t-SNE with unsupervised pipeline:. Topics on the right are represented with their 4 most important keywords as extracted with c-TF-IDF. Next to the topic labels in parentheses are the number of Circulars each topic contains. Note that outliers were excluded from the representation. The topics are also available in Table \ref{['tab:summary_of_topics']} .
  • Figure 4: (a) t-SNE representation of GCN observation-based clusters. The embedding vectors of each circular are classified using cosine similarity between embedded labels and documents. Contrastive fine-tuning helps our model more accurately classify the circular embeddings. (b) Trends over time of the observation-based circular clusters generated through supervised contrastive fine-tuning of our embedding model (all-MiniLM-L6-v2).
  • Figure 5: (a) t-SNE representation of "Gravitational Wave," "Gravitational Wave Counterpart," and "Not Gravitational Wave" clusters computed using contrastive fine-tuning of the all-MiniLM-L6-v2 model. Green points represent machine-learned GW Circulars, red represent GW Counterpart Circulars, and blue represents other Circulars. Yellow diamond points mark Circulars associated with GW 170817, as representative example of cross-cluster overlap. (b) Topic trends over time for the same three clusters generated using supervised fine-tuning of all-MiniLM-L2-v6. GW 170817 is shown as a yellow hatched histogram.
  • ...and 6 more figures