ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study
Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr Kindratenko
TL;DR
ORBIT introduces a scalable data-curation framework that combines embedding-based similarity with a BERT-based educational-value regressor to filter noisy web data into high-quality, domain-specific corpora. Demonstrated on astronomy, it curates a 10-billion-token dataset from a 1.3-trillion-token educational corpus and fine-tunes LLaMA-3-8B on a 1-billion-token astronomy subset, achieving notable improvements on domain benchmarks and outperforming baselines; the approach also generalizes to law and medicine. The key contributions are the two-stage curation pipeline, the cross-domain validation, and the open-source release of datasets, code, and the Orbit model. The work shows that carefully balanced, domain-focused data can yield substantial performance gains with limited computational overhead, enabling more efficient development of specialized AI tools for scientific and professional domains.
Abstract
Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed \textsc{LLaMA-3-8B-base}, with GPT-4o evaluations preferring it in 73\% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT's generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model at \href{https://github.com/ModeEric/ORBIT-Llama}{https://github.com/ModeEric/ORBIT-Llama}.
