MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry
Kurukulasooriya Fernando ana Gianluca Demartini
TL;DR
The paper tackles the domain-knowledge gap in general-purpose LLMs for the mining industry and presents MiningGPT, a 7B instruction-following model fine-tuned with QLoRA on a mining-domain QA dataset. It introduces MiningPile, an open, mining-focused corpus constructed via keyword filtering, embedding-based refinement, and thesis-reports extraction, illustrating that open data can suffice for domain adaptation. Empirical results show MiningGPT achieves a 14% mining-domain performance gain over Mistral-7B-Instruct and outperforms similar-size open-source baselines, while largely preserving general-domain capabilities. The study demonstrates a practical, cost-effective path to domain-specific AI in industry through adapter-based fine-tuning and curated data, with open data resources and a scalable workflow enabling broader adoption.
Abstract
Recent advancements of generative LLMs (Large Language Models) have exhibited human-like language capabilities but have shown a lack of domain-specific understanding. Therefore, the research community has started the development of domain-specific LLMs for many domains. In this work we focus on discussing how to build mining domain-specific LLMs, as the global mining industry contributes significantly to the worldwide economy. We report on MiningGPT, a mining domain-specific instruction-following 7B parameter LLM model which showed a 14\% higher mining domain knowledge test score as compared to its parent model Mistral 7B instruct.
