Predicting the Geolocation of Tweets Using transformer models on Customized Data
Kateryna Lutsai, Christoph H. Lampert
TL;DR
The paper tackles tweet geolocation by finetuning multilingual BERT models to output either direct coordinates or Gaussian Mixture Model parameters, enabling both point estimates and probabilistic location distributions. It introduces a multitask wrapper framework that uses text+user metadata as a Key Feature set and place metadata as a Minor Feature, trained with custom losses that combine geospatial distance and negative log-likelihood for GMMs. The proposed Probabilistic Multiple Outcomes Prediction (PMOP) approach with 5 outcomes achieves strong worldwide performance, while a lower-bound covariance constraint on the GMM helps stabilize training and improve metrics; ablation shows robust country-level results and language-driven effects. The method offers a flexible, scalable, and privacy-conscious route to geotagging large-scale textual data, with a practical plug-and-play setup and potential extensions to other base models and geographic granularities.
Abstract
This research is aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data. The suggested approach implements neural networks for natural language processing (NLP) to estimate the location as coordinate pairs (longitude, latitude) and two-dimensional Gaussian Mixture Models (GMMs). The scope of proposed models has been finetuned on a Twitter dataset using pretrained Bidirectional Encoder Representations from Transformers (BERT) as base models. Performance metrics show a median error of fewer than 30 km on a worldwide-level, and fewer than 15 km on the US-level datasets for the models trained and evaluated on text features of tweets' content and metadata context. Our source code and data are available at https://github.com/K4TEL/geo-twitter.git
