Table of Contents
Fetching ...

Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model

Hossein Ghorbanfekr, Pieter Jan Kerstens, Katrijn Dirix

TL;DR

This study develops GEOBERTje, a domain-adapted Dutch BERT model trained on 283k unlabeled Flemish borehole descriptions to extract lithology information from unstructured text. It then finetunes three separate classifiers for main, secondary, and tertiary lithologies using a labeled subset (≈2.7k samples), employing class-weighted cross-entropy and postprocessing with a confidence threshold. GEOBERTje significantly outperforms a rule-based approach and GPT-4 across lithology levels, achieving 0.94 accuracy for the main lithology and notable gains for secondary and tertiary classes, with improved performance stemming from domain adaptation. The work highlights the value of domain-specific LLMs for converting large volumes of legacy, unstructured geological data into accurate, machine-readable formats, with implications for subsurface modeling and resource estimation.

Abstract

Geological borehole descriptions contain detailed textual information about the composition of the subsurface. However, their unstructured format presents significant challenges for extracting relevant features into a structured format. This paper introduces GEOBERTje: a domain adapted large language model trained on geological borehole descriptions from Flanders (Belgium) in the Dutch language. This model effectively extracts relevant information from the borehole descriptions and represents it into a numeric vector space. Showcasing just one potential application of GEOBERTje, we finetune a classifier model on a limited number of manually labeled observations. This classifier categorizes borehole descriptions into a main, second and third lithology class. We show that our classifier outperforms both a rule-based approach and GPT-4 of OpenAI. This study exemplifies how domain adapted large language models enhance the efficiency and accuracy of extracting information from complex, unstructured geological descriptions. This offers new opportunities for geological analysis and modeling using vast amounts of data.

Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model

TL;DR

This study develops GEOBERTje, a domain-adapted Dutch BERT model trained on 283k unlabeled Flemish borehole descriptions to extract lithology information from unstructured text. It then finetunes three separate classifiers for main, secondary, and tertiary lithologies using a labeled subset (≈2.7k samples), employing class-weighted cross-entropy and postprocessing with a confidence threshold. GEOBERTje significantly outperforms a rule-based approach and GPT-4 across lithology levels, achieving 0.94 accuracy for the main lithology and notable gains for secondary and tertiary classes, with improved performance stemming from domain adaptation. The work highlights the value of domain-specific LLMs for converting large volumes of legacy, unstructured geological data into accurate, machine-readable formats, with implications for subsurface modeling and resource estimation.

Abstract

Geological borehole descriptions contain detailed textual information about the composition of the subsurface. However, their unstructured format presents significant challenges for extracting relevant features into a structured format. This paper introduces GEOBERTje: a domain adapted large language model trained on geological borehole descriptions from Flanders (Belgium) in the Dutch language. This model effectively extracts relevant information from the borehole descriptions and represents it into a numeric vector space. Showcasing just one potential application of GEOBERTje, we finetune a classifier model on a limited number of manually labeled observations. This classifier categorizes borehole descriptions into a main, second and third lithology class. We show that our classifier outperforms both a rule-based approach and GPT-4 of OpenAI. This study exemplifies how domain adapted large language models enhance the efficiency and accuracy of extracting information from complex, unstructured geological descriptions. This offers new opportunities for geological analysis and modeling using vast amounts of data.
Paper Structure (21 sections, 1 equation, 10 figures, 3 tables)

This paper contains 21 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Spatial distribution of the sampled borehole descriptions in Flanders and Brussels.
  • Figure 2: Occurrence of lithology classes according to the rule-based script.
  • Figure 3: Diagram depicting the two-stage training workflow of GEOBERTje for the lithology classification task, utilizing both unlabeled (stage 1) and labeled data (stage 2).
  • Figure 4: GEOBERTje domain adaptation training (red) and validation (blue) loss curves over epochs. Inset figure: learning rate decay as a function of epoch.
  • Figure 5: GEOBERTje model fine tuning training (red) and validation (blue) loss functions for the main, second and third lithology class.
  • ...and 5 more figures