Table of Contents
Fetching ...

MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry

Kurukulasooriya Fernando ana Gianluca Demartini

TL;DR

The paper tackles the domain-knowledge gap in general-purpose LLMs for the mining industry and presents MiningGPT, a 7B instruction-following model fine-tuned with QLoRA on a mining-domain QA dataset. It introduces MiningPile, an open, mining-focused corpus constructed via keyword filtering, embedding-based refinement, and thesis-reports extraction, illustrating that open data can suffice for domain adaptation. Empirical results show MiningGPT achieves a 14% mining-domain performance gain over Mistral-7B-Instruct and outperforms similar-size open-source baselines, while largely preserving general-domain capabilities. The study demonstrates a practical, cost-effective path to domain-specific AI in industry through adapter-based fine-tuning and curated data, with open data resources and a scalable workflow enabling broader adoption.

Abstract

Recent advancements of generative LLMs (Large Language Models) have exhibited human-like language capabilities but have shown a lack of domain-specific understanding. Therefore, the research community has started the development of domain-specific LLMs for many domains. In this work we focus on discussing how to build mining domain-specific LLMs, as the global mining industry contributes significantly to the worldwide economy. We report on MiningGPT, a mining domain-specific instruction-following 7B parameter LLM model which showed a 14\% higher mining domain knowledge test score as compared to its parent model Mistral 7B instruct.

MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry

TL;DR

The paper tackles the domain-knowledge gap in general-purpose LLMs for the mining industry and presents MiningGPT, a 7B instruction-following model fine-tuned with QLoRA on a mining-domain QA dataset. It introduces MiningPile, an open, mining-focused corpus constructed via keyword filtering, embedding-based refinement, and thesis-reports extraction, illustrating that open data can suffice for domain adaptation. Empirical results show MiningGPT achieves a 14% mining-domain performance gain over Mistral-7B-Instruct and outperforms similar-size open-source baselines, while largely preserving general-domain capabilities. The study demonstrates a practical, cost-effective path to domain-specific AI in industry through adapter-based fine-tuning and curated data, with open data resources and a scalable workflow enabling broader adoption.

Abstract

Recent advancements of generative LLMs (Large Language Models) have exhibited human-like language capabilities but have shown a lack of domain-specific understanding. Therefore, the research community has started the development of domain-specific LLMs for many domains. In this work we focus on discussing how to build mining domain-specific LLMs, as the global mining industry contributes significantly to the worldwide economy. We report on MiningGPT, a mining domain-specific instruction-following 7B parameter LLM model which showed a 14\% higher mining domain knowledge test score as compared to its parent model Mistral 7B instruct.

Paper Structure

This paper contains 25 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Clustering of the sentence embedding of the reference knowledge dataset
  • Figure 2: MiningGPT domain knowledge evaluation
  • Figure 3: MiningGPT domain knowledge evaluation.
  • Figure 4: MiningGPT logical reasoning capability evaluation.
  • Figure 5: MiningGPT general knowledge with common sense reasoning evaluation.
  • ...and 4 more figures