Table of Contents
Fetching ...

Leveraging Large Language Models to Geolocate Linguistic Variations in Social Media Posts

Davide Savarro, Davide Zago, Stefano Zoia

TL;DR

The paper tackles geolocating non-standard Italian social media posts by predicting both region and precise coordinates from text. It adopts a joint fine-tuning approach of three Italian LLMs (Camoscio-7B, ANITA-8B, Minerva-3B) using ExtremITA-style prompts to handle subtask A and subtask B in a single generation. Results indicate ANITA-8B achieves the strongest macro-F1 among the tested models (≈0.541) and the lowest mean distance (≈103 km), with Minerva-3B and Camoscio-7B trailing; the performance approaches, but remains below, the 2023 top. The work demonstrates the viability of LLM-based geolocalization for sociolinguistic analysis and highlights avenues for improvement through preprocessing, data augmentation, and imbalance handling.

Abstract

Geolocalization of social media content is the task of determining the geographical location of a user based on textual data, that may show linguistic variations and informal language. In this project, we address the GeoLingIt challenge of geolocalizing tweets written in Italian by leveraging large language models (LLMs). GeoLingIt requires the prediction of both the region and the precise coordinates of the tweet. Our approach involves fine-tuning pre-trained LLMs to simultaneously predict these geolocalization aspects. By integrating innovative methodologies, we enhance the models' ability to understand the nuances of Italian social media text to improve the state-of-the-art in this domain. This work is conducted as part of the Large Language Models course at the Bertinoro International Spring School 2024. We make our code publicly available on GitHub https://github.com/dawoz/geolingit-biss2024.

Leveraging Large Language Models to Geolocate Linguistic Variations in Social Media Posts

TL;DR

The paper tackles geolocating non-standard Italian social media posts by predicting both region and precise coordinates from text. It adopts a joint fine-tuning approach of three Italian LLMs (Camoscio-7B, ANITA-8B, Minerva-3B) using ExtremITA-style prompts to handle subtask A and subtask B in a single generation. Results indicate ANITA-8B achieves the strongest macro-F1 among the tested models (≈0.541) and the lowest mean distance (≈103 km), with Minerva-3B and Camoscio-7B trailing; the performance approaches, but remains below, the 2023 top. The work demonstrates the viability of LLM-based geolocalization for sociolinguistic analysis and highlights avenues for improvement through preprocessing, data augmentation, and imbalance handling.

Abstract

Geolocalization of social media content is the task of determining the geographical location of a user based on textual data, that may show linguistic variations and informal language. In this project, we address the GeoLingIt challenge of geolocalizing tweets written in Italian by leveraging large language models (LLMs). GeoLingIt requires the prediction of both the region and the precise coordinates of the tweet. Our approach involves fine-tuning pre-trained LLMs to simultaneously predict these geolocalization aspects. By integrating innovative methodologies, we enhance the models' ability to understand the nuances of Italian social media text to improve the state-of-the-art in this domain. This work is conducted as part of the Large Language Models course at the Bertinoro International Spring School 2024. We make our code publicly available on GitHub https://github.com/dawoz/geolingit-biss2024.
Paper Structure (10 sections, 4 figures, 3 tables)

This paper contains 10 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Bar chart representing the number of posts for each label (region of provenance). The labels (x-axis) are sorted by descreasing frequency (y-axis).
  • Figure 2: Geographical distribution of the social media posts in the GeoLingIt dataset.
  • Figure 3: Confusion matrices for the classification of the samples in the test set for all tested models: Camoscio (\ref{['fig:confusion_matrix_camoscio']}), ANITA (\ref{['fig:confusion_matrix_anita']}) and Minerva (\ref{['fig:confusion_matrix_minerva']}). The classes on the x and y axis include only the classes present in the test set, which are a subset of Italian regions. The numbers in each cell $(c_{pred},c_{true})$ correspond to the frequency of samples with class $c_{true}$ classified as $c_{pred}$ and normalized by the total number of samples of the true class (row). Cells containing "-" mean zero frequency of classified samples. Darker colors highlight higher frequencies, and a darker main diagonal on the matrix implies strong classification performance.
  • Figure 4: Heathmaps of the the regression error (in km) over Italian provinces for all the tested models. The figures in the left column (\ref{['fig:regression_error_llama_sum']}, \ref{['fig:regression_error_anita_sum']} and \ref{['fig:regression_error_minerva_sum']}) show the sum of distance error over the area of a province. Instead, figures in the right column (\ref{['fig:regression_error_llama_mean']}, \ref{['fig:regression_error_anita_mean']} and \ref{['fig:regression_error_minerva_mean']}) show the average distance error over the same areas.