Table of Contents
Fetching ...

GeoGalactica: A Scientific Large Language Model in Geoscience

Zhouhan Lin, Cheng Deng, Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Zhongmou He, Yuanyuan Shi, Beiya Dai, Yunchong Song, Boyi Zeng, Qiyuan Chen, Yuxun Miao, Bo Xue, Shu Wang, Luoyi Fu, Weinan Zhang, Junxian He, Yunqiang Zhu, Xinbing Wang, Chenghu Zhou

TL;DR

GeoGalactica addresses the need for geoscience-specific NLP capable of scientific reasoning by further pre-training a Galactica-based 30B model on a 65B-token geoscience corpus and fine-tuning with 1M domain-focused instructions. The approach integrates a comprehensive data-curation pipeline (GeoCoprus, GeoSignal), domain knowledge resources (GAKG, GSO), and tool-learning capabilities, evaluated with GeoBench, MMLU, and extensive human assessments. Results show GeoGalactica outperforms baselines on geoscience benchmarks and exhibits strong performance in domain-relevant tasks, while highlighting areas for improvement in cross-domain reasoning and certain physics/medical topics. The work contributes open-source data-processing tools, curated datasets, and a concrete pathway toward unified geoscience foundation models with implications for research, education, and disaster mitigation.

Abstract

Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.

GeoGalactica: A Scientific Large Language Model in Geoscience

TL;DR

GeoGalactica addresses the need for geoscience-specific NLP capable of scientific reasoning by further pre-training a Galactica-based 30B model on a 65B-token geoscience corpus and fine-tuning with 1M domain-focused instructions. The approach integrates a comprehensive data-curation pipeline (GeoCoprus, GeoSignal), domain knowledge resources (GAKG, GSO), and tool-learning capabilities, evaluated with GeoBench, MMLU, and extensive human assessments. Results show GeoGalactica outperforms baselines on geoscience benchmarks and exhibits strong performance in domain-relevant tasks, while highlighting areas for improvement in cross-domain reasoning and certain physics/medical topics. The work contributes open-source data-processing tools, curated datasets, and a concrete pathway toward unified geoscience foundation models with implications for research, education, and disaster mitigation.

Abstract

Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.
Paper Structure (103 sections, 1 equation, 39 figures, 16 tables)

This paper contains 103 sections, 1 equation, 39 figures, 16 tables.

Figures (39)

  • Figure 1: The overview of the processing, construction, components, and applications of GeoGalactica.
  • Figure 2: The progression illustration of geoscience research with the use of cutting-edge AI techniques. The textboxes in PaleTurquoise show the techniques from computer science, The textboxes, in Bisque show the research that probably the first time geoscientists used the techniques.
  • Figure 3: Tokenization processed text. A. shows an example of a figure marker, we only choose to preserve the captions; B. shows an example of a table marker, we transfer the tables into the form of Markdown; C. shows the tokenization of the citations, we replace the reference numbers into reference papers’ title to preserve the readability of the text corpus; D. shows an example of the special tokens for formulas.
  • Figure 4: Four platforms that contribute most to our GeoSignal.
  • Figure 5: An example for illustrating the construction of restructured knowledge-intensive instruction data.
  • ...and 34 more figures