Table of Contents
Fetching ...

Information Extraction from Historical Well Records Using A Large Language Model

Zhiwei Ma, Javier E. Santo, Greg Lackey, Hari Viswanathan, Daniel O'Malley

TL;DR

This paper tackles the problem of locating vital information about orphaned wells from historical, unstructured documents. It presents an end-to-end workflow that combines OCR text extraction with prompting of open-source large language models, notably Llama 2, to extract well location and true vertical depth (TVD) and express outputs as structured data. The study demonstrates that, with carefully designed prompts and larger models, location extraction can reach near-perfect accuracy on clean PDFs (and up to 100% in Colorado data), while depth extraction benefits from prompt complexity but remains challenging for image-based records, with GPT-3.5 delivering 100% accuracy on both datasets. The work highlights practical benefits for automated digitization and risk reduction in oil and gas operations, and discusses key challenges—OCR quality, hardware demands, and the potential of multi-modal or fine-tuned models—to scale the approach in real-world settings.

Abstract

To reduce environmental risks and impacts from orphaned wells (abandoned oil and gas wells), it is essential to first locate and then plug these wells. Although some historical documents are available, they are often unstructured, not cleaned, and outdated. Additionally, they vary widely by state and type. Manual reading and digitizing this information from historical documents are not feasible, given the high number of wells. Here, we propose a new computational approach for rapidly and cost-effectively locating these wells. Specifically, we leverage the advanced capabilities of large language models (LLMs) to extract vital information including well location and depth from historical records of orphaned wells. In this paper, we present an information extraction workflow based on open-source Llama 2 models and test them on a dataset of 160 well documents. Our results show that the developed workflow achieves excellent accuracy in extracting location and depth from clean, PDF-based reports, with a 100% accuracy rate. However, it struggles with unstructured image-based well records, where accuracy drops to 70%. The workflow provides significant benefits over manual human digitization, including reduced labor and increased automation. In general, more detailed prompting leads to improved information extraction, and those LLMs with more parameters typically perform better. We provided a detailed discussion of the current challenges and the corresponding opportunities/approaches to address them. Additionally, a vast amount of geoscientific information is locked up in old documents, and this work demonstrates that recent breakthroughs in LLMs enable us to unlock this information more broadly.

Information Extraction from Historical Well Records Using A Large Language Model

TL;DR

This paper tackles the problem of locating vital information about orphaned wells from historical, unstructured documents. It presents an end-to-end workflow that combines OCR text extraction with prompting of open-source large language models, notably Llama 2, to extract well location and true vertical depth (TVD) and express outputs as structured data. The study demonstrates that, with carefully designed prompts and larger models, location extraction can reach near-perfect accuracy on clean PDFs (and up to 100% in Colorado data), while depth extraction benefits from prompt complexity but remains challenging for image-based records, with GPT-3.5 delivering 100% accuracy on both datasets. The work highlights practical benefits for automated digitization and risk reduction in oil and gas operations, and discusses key challenges—OCR quality, hardware demands, and the potential of multi-modal or fine-tuned models—to scale the approach in real-world settings.

Abstract

To reduce environmental risks and impacts from orphaned wells (abandoned oil and gas wells), it is essential to first locate and then plug these wells. Although some historical documents are available, they are often unstructured, not cleaned, and outdated. Additionally, they vary widely by state and type. Manual reading and digitizing this information from historical documents are not feasible, given the high number of wells. Here, we propose a new computational approach for rapidly and cost-effectively locating these wells. Specifically, we leverage the advanced capabilities of large language models (LLMs) to extract vital information including well location and depth from historical records of orphaned wells. In this paper, we present an information extraction workflow based on open-source Llama 2 models and test them on a dataset of 160 well documents. Our results show that the developed workflow achieves excellent accuracy in extracting location and depth from clean, PDF-based reports, with a 100% accuracy rate. However, it struggles with unstructured image-based well records, where accuracy drops to 70%. The workflow provides significant benefits over manual human digitization, including reduced labor and increased automation. In general, more detailed prompting leads to improved information extraction, and those LLMs with more parameters typically perform better. We provided a detailed discussion of the current challenges and the corresponding opportunities/approaches to address them. Additionally, a vast amount of geoscientific information is locked up in old documents, and this work demonstrates that recent breakthroughs in LLMs enable us to unlock this information more broadly.
Paper Structure (18 sections, 2 equations, 6 figures, 7 tables)

This paper contains 18 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The proposed workflow for well information extraction via LLM.
  • Figure 2: An illustration of model inputs and outputs for LLM. Note that we aim to show the structure of the model's input and output. One has to provide specific well record texts to the model input section, and the LLM would generate the corresponding detailed output in terms of well location and depth.
  • Figure 3: Examples of well records used in this study.
  • Figure 4: Part of the texts extracted from the two well records shown in Figure \ref{['fig:well_record_examples_for_this_paper']}.
  • Figure 5: An example of information extraction output using Llama 2 70B.
  • ...and 1 more figures