Table of Contents
Fetching ...

Geological Inference from Textual Data using Word Embeddings

Nanmanas Linphrachaya, Irving Gómez-Méndez, Adil Siripatana

TL;DR

The paper tackles locating geological resources, notably lithium, by mining domain-specific texts with NLP using GloVe embeddings trained on British Columbia Geology data. It evaluates four dimensionality reduction methods—PCA, Autoencoder, VAE, and VAE-LSTM—to map high-dimensional embeddings into a 2D latent space and identify semantically related cities, validated against known lithium mines via cosine similarity and haversine distance. Results show non-linear approaches, especially Autoencoder, provide the strongest spatial alignment to actual deposits, outperforming linear PCA and other nonlinear variants in this setting. The study demonstrates that combining geoscience text mining with advanced dimensionality reduction can yield meaningful geospatial insights and suggests avenues for refinement, such as disambiguating city names and integrating additional geographic cues to broaden applicability.

Abstract

This research explores the use of Natural Language Processing (NLP) techniques to locate geological resources, with a specific focus on industrial minerals. By using word embeddings trained with the GloVe model, we extract semantic relationships between target keywords and a corpus of geological texts. The text is filtered to retain only words with geographical significance, such as city names, which are then ranked by their cosine similarity to the target keyword. Dimensional reduction techniques, including Principal Component Analysis (PCA), Autoencoder, Variational Autoencoder (VAE), and VAE with Long Short-Term Memory (VAE-LSTM), are applied to enhance feature extraction and improve the accuracy of semantic relations. For benchmarking, we calculate the proximity between the ten cities most semantically related to the target keyword and identified mine locations using the haversine equation. The results demonstrate that combining NLP with dimensional reduction techniques provides meaningful insights into the spatial distribution of natural resources. Although the result shows to be in the same region as the supposed location, the accuracy has room for improvement.

Geological Inference from Textual Data using Word Embeddings

TL;DR

The paper tackles locating geological resources, notably lithium, by mining domain-specific texts with NLP using GloVe embeddings trained on British Columbia Geology data. It evaluates four dimensionality reduction methods—PCA, Autoencoder, VAE, and VAE-LSTM—to map high-dimensional embeddings into a 2D latent space and identify semantically related cities, validated against known lithium mines via cosine similarity and haversine distance. Results show non-linear approaches, especially Autoencoder, provide the strongest spatial alignment to actual deposits, outperforming linear PCA and other nonlinear variants in this setting. The study demonstrates that combining geoscience text mining with advanced dimensionality reduction can yield meaningful geospatial insights and suggests avenues for refinement, such as disambiguating city names and integrating additional geographic cues to broaden applicability.

Abstract

This research explores the use of Natural Language Processing (NLP) techniques to locate geological resources, with a specific focus on industrial minerals. By using word embeddings trained with the GloVe model, we extract semantic relationships between target keywords and a corpus of geological texts. The text is filtered to retain only words with geographical significance, such as city names, which are then ranked by their cosine similarity to the target keyword. Dimensional reduction techniques, including Principal Component Analysis (PCA), Autoencoder, Variational Autoencoder (VAE), and VAE with Long Short-Term Memory (VAE-LSTM), are applied to enhance feature extraction and improve the accuracy of semantic relations. For benchmarking, we calculate the proximity between the ten cities most semantically related to the target keyword and identified mine locations using the haversine equation. The results demonstrate that combining NLP with dimensional reduction techniques provides meaningful insights into the spatial distribution of natural resources. Although the result shows to be in the same region as the supposed location, the accuracy has room for improvement.

Paper Structure

This paper contains 27 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Map of lithium mines in the world, surveyed by British Geological Survey (BGS).
  • Figure 2: Overview of the methodology. Starting from pre-processing the text into tokens of text which will be used to train GloVe allowing them to be represented using word embeddings. These embeddings will be transformed based on dimensionality reduction techniques to filter the insignificant features. The correlation between the keyword and other words will be calculated using cosine similarity into scores which will indicate the locational significance word that will be used to predict the location of the selected keyword.
  • Figure 3: Graphical representation of the Autoencoder.
  • Figure 4: The figure presents a comparative analysis of five dimensionality reduction techniques—No Reduction, PCA, Autoencoder, VAE, and VAE-LSTM—evaluated through the haversine benchmark. Each method is represented by a world map projection displaying cities selected based on cosine similarity to the keyword "lithium." Cities are plotted as blue dots relative to actual lithium mine locations represented as red dots, which act as a reference for spatial accuracy. The \ref{['tab:rmse_error']} summarizes the prediction error for each technique to provide a quantitative measure of distance accuracy.