Table of Contents
Fetching ...

On the Scaling Laws of Geographical Representation in Language Models

Nathan Godey, Éric de la Clergerie, Benoît Sagot

TL;DR

This work investigates how geographical knowledge embedded in hidden representations of language models evolves with model scale across diverse architectures. A linear ridge probe maps latent prompts to coordinates using the World dataset, reporting $R^2$ as the performance metric; results show geographical signals exist even in tiny models and improve with scale. Crucially, larger models exhibit stronger geographical bias tied to pretraining data, with coordinate accuracy correlating with country-name frequency, while population counts show little relation. The findings imply that scaling up LLMs can amplify data-driven geographical biases, underscoring the need for data-centric bias mitigation alongside careful consideration of pretraining corpora.

Abstract

Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size. Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.

On the Scaling Laws of Geographical Representation in Language Models

TL;DR

This work investigates how geographical knowledge embedded in hidden representations of language models evolves with model scale across diverse architectures. A linear ridge probe maps latent prompts to coordinates using the World dataset, reporting as the performance metric; results show geographical signals exist even in tiny models and improve with scale. Crucially, larger models exhibit stronger geographical bias tied to pretraining data, with coordinate accuracy correlating with country-name frequency, while population counts show little relation. The findings imply that scaling up LLMs can amplify data-driven geographical biases, underscoring the need for data-centric bias mitigation alongside careful consideration of pretraining corpora.

Abstract

Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size. Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.
Paper Structure (8 sections, 1 equation, 6 figures)

This paper contains 8 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Predicted coordinates of test set instances for different model sizes. Each color represents a different continent.
  • Figure 2: Evolution of the $R^2$ coefficient on the test set for various model suites.
  • Figure 3: Average MSE by continent for different sizes in the Pythia suite.
  • Figure 4: Gini coefficients of MSE on the test set averaged by country or by continent, as model size increases.
  • Figure 5: Test log-MSE for Pythia-1B as plotted on a World map.
  • ...and 1 more figures