Table of Contents
Fetching ...

World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Elan Barenholtz

TL;DR

It is suggested that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone.

Abstract

Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.

World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

TL;DR

It is suggested that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone.

Abstract

Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.
Paper Structure (29 sections, 1 equation, 3 figures, 3 tables)

This paper contains 29 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Actual (open circles) versus predicted (filled triangles) city locations from ridge regression probes on GloVe and Word2Vec embeddings for all 99 cities. Lines connect actual and predicted positions for seven labeled cities to illustrate typical prediction error.
  • Figure 2: Actual (open circles, on the diagonal) versus predicted (filled triangles) birth year for 194 historical figures. Lines connect actual and predicted values for labeled figures. The probe captures era-level temporal structure from both GloVe and Word2Vec embeddings.
  • Figure 3: Data-driven identification of words whose GloVe embeddings correlate with city temperature. For each of the 17,000+ common English words passing our filters, we compute cosine similarity to all 86 city embeddings and correlate with mean annual temperature. Shown are the 15 most positively correlated (red, associated with warmer cities) and 15 most negatively correlated (blue, associated with colder cities). All correlations are significant at $p < 10^{-7}$ (* = $p < 0.05$). No words were selected a priori; the semantic categories emerge from the data.