Table of Contents
Fetching ...

Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data

Jiyoon Pyo, Yao-Yi Chiang

TL;DR

This work addresses mineral site record linkage across heterogeneous MRDS and USMIN datasets, where missing values and varying spatial representations complicate linking records that refer to the same deposit. It introduces a hybrid pipeline that first uses an LLM to generate labeled training data and then fine-tunes a RoBERTa classifier on serialized record pairs to perform match/non-match classification efficiently. The method achieves substantial gains in macro-F1 compared with ground-truth trained baselines and reduces inference time by about 18x relative to using LLMs alone, enabling scalable nationwide linkage. The work also presents an automated pipeline and discusses future directions for integrating spatial semantics and further data generation to improve robustness.

Abstract

Record linkage integrates diverse data sources by identifying records that refer to the same entity. In the context of mineral site records, accurate record linkage is crucial for identifying and mapping mineral deposits. Properly linking records that refer to the same mineral deposit helps define the spatial coverage of mineral areas, benefiting resource identification and site data archiving. Mineral site record linkage falls under the spatial record linkage category since the records contain information about the physical locations and non-spatial attributes in a tabular format. The task is particularly challenging due to the heterogeneity and vast scale of the data. While prior research employs pre-trained discriminative language models (PLMs) on spatial entity linkage, they often require substantial amounts of curated ground-truth data for fine-tuning. Gathering and creating ground truth data is both time-consuming and costly. Therefore, such approaches are not always feasible in real-world scenarios where gold-standard data are unavailable. Although large generative language models (LLMs) have shown promising results in various natural language processing tasks, including record linkage, their high inference time and resource demand present challenges. We propose a method that leverages an LLM to generate training data and fine-tune a PLM to address the training data gap while preserving the efficiency of PLMs. Our approach achieves over 45\% improvement in F1 score for record linkage compared to traditional PLM-based methods using ground truth data while reducing the inference time by nearly 18 times compared to relying on LLMs. Additionally, we offer an automated pipeline that eliminates the need for human intervention, highlighting this approach's potential to overcome record linkage challenges.

Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data

TL;DR

This work addresses mineral site record linkage across heterogeneous MRDS and USMIN datasets, where missing values and varying spatial representations complicate linking records that refer to the same deposit. It introduces a hybrid pipeline that first uses an LLM to generate labeled training data and then fine-tunes a RoBERTa classifier on serialized record pairs to perform match/non-match classification efficiently. The method achieves substantial gains in macro-F1 compared with ground-truth trained baselines and reduces inference time by about 18x relative to using LLMs alone, enabling scalable nationwide linkage. The work also presents an automated pipeline and discusses future directions for integrating spatial semantics and further data generation to improve robustness.

Abstract

Record linkage integrates diverse data sources by identifying records that refer to the same entity. In the context of mineral site records, accurate record linkage is crucial for identifying and mapping mineral deposits. Properly linking records that refer to the same mineral deposit helps define the spatial coverage of mineral areas, benefiting resource identification and site data archiving. Mineral site record linkage falls under the spatial record linkage category since the records contain information about the physical locations and non-spatial attributes in a tabular format. The task is particularly challenging due to the heterogeneity and vast scale of the data. While prior research employs pre-trained discriminative language models (PLMs) on spatial entity linkage, they often require substantial amounts of curated ground-truth data for fine-tuning. Gathering and creating ground truth data is both time-consuming and costly. Therefore, such approaches are not always feasible in real-world scenarios where gold-standard data are unavailable. Although large generative language models (LLMs) have shown promising results in various natural language processing tasks, including record linkage, their high inference time and resource demand present challenges. We propose a method that leverages an LLM to generate training data and fine-tune a PLM to address the training data gap while preserving the efficiency of PLMs. Our approach achieves over 45\% improvement in F1 score for record linkage compared to traditional PLM-based methods using ground truth data while reducing the inference time by nearly 18 times compared to relying on LLMs. Additionally, we offer an automated pipeline that eliminates the need for human intervention, highlighting this approach's potential to overcome record linkage challenges.

Paper Structure

This paper contains 26 sections, 5 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Illustration of the mineral site record linkage process. The pipeline must accurately link records despite variations in recorded information and style. Green highlights records that should be linked, while red indicates records that should not be linked.
  • Figure 2: Image of Eagle Mine, where each color represents a mineral site record from different databases mrdsusmineagle1eagle2. The actual polygon region of Eagle Mine is highlighted in light purple osm.
  • Figure 3: Image of General Washington Placer and Henderson Mine, with mineral records in different colors to represent distinct mineral sites.
  • Figure 4: Record-to-record distance distribution of match data from OSM-FSQ/OSM-Yelp compared to mineral sites data. While OSM-FSQ and OSM-Yelp data fits within a 2.5-kilometer range, the distance between match mineral site records can range up to 35 kilometers, with a similar distribution across the bins. This highlights the limitation of spatial record linkage methods that rely on empirically defined distance thresholds since such approaches may not be suitable for domains with large spatial distance variance, like mineral sites.
  • Figure 5: Sample of MRDS Data displaying the heterogeneity of the data. Some of the attributes are left blank, and some attributes consist purely of unique values.
  • ...and 6 more figures