Table of Contents
Fetching ...

ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr Kindratenko

TL;DR

ORBIT introduces a scalable data-curation framework that combines embedding-based similarity with a BERT-based educational-value regressor to filter noisy web data into high-quality, domain-specific corpora. Demonstrated on astronomy, it curates a 10-billion-token dataset from a 1.3-trillion-token educational corpus and fine-tunes LLaMA-3-8B on a 1-billion-token astronomy subset, achieving notable improvements on domain benchmarks and outperforming baselines; the approach also generalizes to law and medicine. The key contributions are the two-stage curation pipeline, the cross-domain validation, and the open-source release of datasets, code, and the Orbit model. The work shows that carefully balanced, domain-focused data can yield substantial performance gains with limited computational overhead, enabling more efficient development of specialized AI tools for scientific and professional domains.

Abstract

Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed \textsc{LLaMA-3-8B-base}, with GPT-4o evaluations preferring it in 73\% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT's generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model at \href{https://github.com/ModeEric/ORBIT-Llama}{https://github.com/ModeEric/ORBIT-Llama}.

ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

TL;DR

ORBIT introduces a scalable data-curation framework that combines embedding-based similarity with a BERT-based educational-value regressor to filter noisy web data into high-quality, domain-specific corpora. Demonstrated on astronomy, it curates a 10-billion-token dataset from a 1.3-trillion-token educational corpus and fine-tunes LLaMA-3-8B on a 1-billion-token astronomy subset, achieving notable improvements on domain benchmarks and outperforming baselines; the approach also generalizes to law and medicine. The key contributions are the two-stage curation pipeline, the cross-domain validation, and the open-source release of datasets, code, and the Orbit model. The work shows that carefully balanced, domain-focused data can yield substantial performance gains with limited computational overhead, enabling more efficient development of specialized AI tools for scientific and professional domains.

Abstract

Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed \textsc{LLaMA-3-8B-base}, with GPT-4o evaluations preferring it in 73\% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT's generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model at \href{https://github.com/ModeEric/ORBIT-Llama}{https://github.com/ModeEric/ORBIT-Llama}.

Paper Structure

This paper contains 56 sections, 18 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comprehensive Filtering Pipeline from FineWeb-Edu to ORBIT. The pipeline emphasizes the quality and size of the dataset. The orange includes common filtering methods formalized in wenzek_ccnet:_2020. The yellow summarizes large-scale semantic filters from raffel_exploring_2023. The green includes the additional semantic filters and the BERT-based classifier used to filter for educational relevance in FineWeb-Edu. The blue outlines our contributions: GloVe-based embedding thresholding and a BERT classifier for educational relevance specific to astronomy. See subsections \ref{['sec:dataset_curation']} and \ref{['sec:educational_assessment']} for details on our contributions.
  • Figure 2: Full Stage 2 pipeline visualized.
  • Figure 3: Distribution of educational value scores (ranging from 0 to 5) assigned by the BERT-based regressor model to a sample of 1000 astronomy-related documents. This visualization demonstrates the validity of the classifier by showing alignment with expected distributions based on held-out test sets and expert evaluations.
  • Figure 4: Average Score vs Percent Kept, comparing different filtering methods: embedding thresholds (fastText, 100d, 300d), keyword filtering, and no filtering. The x-axis is log-scaled for clarity.
  • Figure 5: Distribution of residual components for the domain-specific embeddings ($m = 100$). The residuals exhibit a normal distribution centered near zero, validating that noise diminishes with an increasing number of domain-relevant terms. This result supports the robustness of our astronomy vector in representing domain relevance while minimizing noise.