Table of Contents
Fetching ...

Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics

Arno Simons

TL;DR

Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses, suggesting that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers.

Abstract

I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses. This suggests that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers, enabling high performance without the need for extensive training from scratch.

Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics

TL;DR

Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses, suggesting that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers.

Abstract

I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses. This suggests that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers, enabling high performance without the need for extensive training from scratch.

Paper Structure

This paper contains 5 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: The Astro-HEP Corpus: 21.84M paragraphs found in 0.61M articles on astrophysics (ASTRO) and/or high energy physics (HEP) published between 1986 and 2022 on arXiv.
  • Figure 2: Distribution of paragraph length before filtering out short paragraphs---35.38M paragraphs found in 0.61M articles on astrophysics (ASTRO) and/or high energy physics (HEP) published between 1986 and 2022 on arXiv.
  • Figure 3: Distribution of whitespace rate before filtering out paragraphs with a rate of less than 0.1 or more than 0.2---22.03M paragraphs found in 0.61M articles on astrophysics (ASTRO) and/or high energy physics (HEP) published between 1986 and 2022 on arXiv.
  • Figure 4: Decreasing cross-entropy loss during the extended pretraining of Astro-HEP-BERT.