Predicting the Geolocation of Tweets Using transformer models on Customized Data

Kateryna Lutsai; Christoph H. Lampert

Predicting the Geolocation of Tweets Using transformer models on Customized Data

Kateryna Lutsai, Christoph H. Lampert

TL;DR

The paper tackles tweet geolocation by finetuning multilingual BERT models to output either direct coordinates or Gaussian Mixture Model parameters, enabling both point estimates and probabilistic location distributions. It introduces a multitask wrapper framework that uses text+user metadata as a Key Feature set and place metadata as a Minor Feature, trained with custom losses that combine geospatial distance and negative log-likelihood for GMMs. The proposed Probabilistic Multiple Outcomes Prediction (PMOP) approach with 5 outcomes achieves strong worldwide performance, while a lower-bound covariance constraint on the GMM helps stabilize training and improve metrics; ablation shows robust country-level results and language-driven effects. The method offers a flexible, scalable, and privacy-conscious route to geotagging large-scale textual data, with a practical plug-and-play setup and potential extensions to other base models and geographic granularities.

Abstract

This research is aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data. The suggested approach implements neural networks for natural language processing (NLP) to estimate the location as coordinate pairs (longitude, latitude) and two-dimensional Gaussian Mixture Models (GMMs). The scope of proposed models has been finetuned on a Twitter dataset using pretrained Bidirectional Encoder Representations from Transformers (BERT) as base models. Performance metrics show a median error of fewer than 30 km on a worldwide-level, and fewer than 15 km on the US-level datasets for the models trained and evaluated on text features of tweets' content and metadata context. Our source code and data are available at https://github.com/K4TEL/geo-twitter.git

Predicting the Geolocation of Tweets Using transformer models on Customized Data

TL;DR

Abstract

Paper Structure (13 sections, 20 equations, 7 figures, 5 tables)

This paper contains 13 sections, 20 equations, 7 figures, 5 tables.

Introduction
Related works
Materials and Methods
Data preprocessing
Model architecture
Results
Worldwide evaluation metrics by model type
Ablation study metrics
Discussion and Conclusion
Acknowledgments
Per-user geolocation estimation
Performance metrics
Geospatial metrics

Figures (7)

Figure 1: Model finetuning procedure flowchart for the best of the proposed models which has the Probabilistic Multiple Prediction Outcomes (PMOP) output type and individual wrapper layers for the necessary Key Feature (KF) and optional Minor Feature (MF) Step 1: parameterization of the model's architecture, outputs, and finetuning; Step 2: importing default tokenizer and base BERT devlin2018bert model; Step 3: reading dataset files from the Twitter archive into a virtual sheet; Step 4: forming inputs NON-GEO as KF and GEO-ONLY as MF according to Section \ref{['sec:data-preprocess']}; Step 5: tokenization of the text inputs through the standard BERT tokenizer; Step 6: forming of the model data loaders by selection of batch subsets from data; Step 7: loading input features to the model, each to separate wrapper layer; Step 8: text processing by the local base BERT model before applying wrapper layers; Step 9: returning defined in size model outputs according to Table \ref{['tab:model-type-outputs']}; Step 10: comparing processed predictions and labels according to Section \ref{['sec:loss-functions']}; Step 11: backpropagation of the total per-batch loss to the local base BERT version.
Figure 2: Squared Euclidean Distance \ref{['eq:sed']} function surface on the axes of $\Delta Y_{lon}$ and $\Delta Y_{lat}$ as the error distances per longitude and latitude axes; upper horizontal gray surface indicates the empirical maximum of $L_{spat}$; red line indicates the strict minimum of 0 implied by the nature of Eq. \ref{['eq:sed']}.
Figure 3: Negative Log-LikeliHood \ref{['eq:nllh']} function surface on the axes of $D^2$ as the error distance and $\sigma_{\widehat{c}}$ as the uncertainty in $\widehat{\mu}$ of the Gaussian; red lines and gray surfaces indicate the reduction of $L_{prob}$ domain as a result of Eq. \ref{['eq:lbsp']} application.
Figure 4: Single Outcome Prediction (SOP) model loss functions computational graph including visualization of Squared Euclidean Distance \ref{['eq:sed']}, Lower-Bounded SoftPlus \ref{['eq:lbsp']}, and Negative Log-LikeliHood \ref{['eq:nllh']} components.
Figure 5: Prediction examples of two models, which were trained on the same worldwide dataset with key NON-GEO and minor GEO-ONLY text features, for the same text: 'CIA and FBI can track anyone, and you willingly give the data away' with 5 outcomes sorted by significance (weight); above: Geospatial GMOP points as scatter plots; below: Probabilistic PMOP Gaussian peaks as LLH and PDF plots.
...and 2 more figures

Predicting the Geolocation of Tweets Using transformer models on Customized Data

TL;DR

Abstract

Predicting the Geolocation of Tweets Using transformer models on Customized Data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)