K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization

Cheng Deng; Tianhang Zhang; Zhongmou He; Yi Xu; Qiyuan Chen; Yuanyuan Shi; Luoyi Fu; Weinan Zhang; Xinbing Wang; Chenghu Zhou; Zhouhan Lin; Junxian He

K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization

Cheng Deng, Tianhang Zhang, Zhongmou He, Yi Xu, Qiyuan Chen, Yuanyuan Shi, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, Junxian He

TL;DR

This work tackles the scarcity of domain-aware LLMs in geoscience by building K2, a 7B open-source model derived from LLaMA-7B through extensive geoscience-focused pre-training and instruction tuning. It introduces GeoSignal, a geoscience-supervised instruction dataset, and GeoBench, a benchmark for geoscience knowledge understanding and utilization, enabling rigorous evaluation. The authors demonstrate a three-stage domain-adaptation pipeline (further pre-training, general instruction tuning, and expert-alignment with GeoSignal) and show that tool learning enhances external API usage. All resources, including data, weights, and benchmarks, are released to promote reproducibility and further development in geoscience NLP. The results indicate improved performance on both objective tasks and subjective reasoning, highlighting K2’s potential for research assistance and knowledge reasoning in geoscience contexts.

Abstract

Large language models (LLMs) have achieved great success in general domains of natural language processing. In this paper, we bring LLMs to the realm of geoscience with the objective of advancing research and applications in this field. To this end, we present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience. For instance, we have curated the first geoscience instruction tuning dataset, GeoSignal, which aims to align LLM responses to geoscience-related user queries. Additionally, we have established the first geoscience benchmark, GeoBench, to evaluate LLMs in the context of geoscience. In this work, we experiment with a complete recipe to adapt a pre-trained general-domain LLM to the geoscience domain. Specifically, we further train the LLaMA-7B model on 5.5B tokens of geoscience text corpus, including over 1 million pieces of geoscience literature, and utilize GeoSignal's supervised data to fine-tune the model. Moreover, we share a protocol that can efficiently gather domain-specific data and construct domain-supervised data, even in situations where manpower is scarce. Meanwhile, we equip K2 with the abilities of using tools to be a naive geoscience aide. Experiments conducted on the GeoBench demonstrate the effectiveness of our approach and datasets on geoscience knowledge understanding and utilization.We open-source all the training data and K2 model checkpoints at https://github.com/davendw49/k2.

K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 7 figures, 10 tables)

This paper contains 29 sections, 1 equation, 7 figures, 10 tables.

Introduction
Related Work
Data Collection and Curation
Pre-training Data
Geoscience Text Corpus Collection
Geoscience Open Access Literatures.
Wikipedia pages about Earth science
Text Corpus Preprocessing
PDF Parsing
Tokenization
Instruction Tuning Data: GeoSignal
Align-to-Human
Align-to-Expert
Evaluation on Expertise in Geoscience: GeoBench
NPEE
...and 14 more sections

Figures (7)

Figure 1: Pipeline of training K2, including two steps, one is further pre-train for absorption of geoscience knowledge, another one is instruction tuning, deploying to make the model align to human, instructed by human, and response like a human.
Figure 2: Tokenization processed text.A. shows an example of a figure marker, we only choose to preserve the captions; B. shows an example of a table marker, we transfer the tables into the form of Markdown; C. shows the tokenization of the citations, we replace the reference numbers into reference papers' title to preserve the readability of the text corpus; D. shows an example of the special tokens for formulas.
Figure 3: The components of GeoSignal.
Figure 4: An example for GeoSignal re-structure when processsing the geoscience website https://rruff.info/.
Figure 5: Training recipe for domain large language models.
...and 2 more figures

K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization

TL;DR

Abstract

K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)