Table of Contents
Fetching ...

Vocabulary embeddings organize linguistic structure early in language model training

Isabel Papadimitriou, Jacob Prince

TL;DR

This study investigates how vocabulary embeddings in large language models organize linguistically relevant structure during training. By applying Representational Similarity Analysis in two open-source models across thousands of checkpoints, it documents an early emergence of semantic structure and early peaks in syntactic organization, with word frequency shaping longer-term geometry. High-frequency and function words converge rapidly, while low-frequency tokens retain biases from random initializations and gradually align with frequency-rank patterns, revealing distinct roles for frequency and morphology in embedding dynamics. The findings offer a mechanistic view of how lexical representations bootstrap linguistic structure and suggest avenues for targeted interpretability and training-efficiency improvements.

Abstract

Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., "the," "of") converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.

Vocabulary embeddings organize linguistic structure early in language model training

TL;DR

This study investigates how vocabulary embeddings in large language models organize linguistically relevant structure during training. By applying Representational Similarity Analysis in two open-source models across thousands of checkpoints, it documents an early emergence of semantic structure and early peaks in syntactic organization, with word frequency shaping longer-term geometry. High-frequency and function words converge rapidly, while low-frequency tokens retain biases from random initializations and gradually align with frequency-rank patterns, revealing distinct roles for frequency and morphology in embedding dynamics. The findings offer a mechanistic view of how lexical representations bootstrap linguistic structure and suggest avenues for targeted interpretability and training-efficiency improvements.

Abstract

Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., "the," "of") converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.

Paper Structure

This paper contains 39 sections, 13 figures.

Figures (13)

  • Figure 1: A schematic illustrating our two uses of RSA. In Hypothesis RSA, we take the vocabulary matrix, correlating models to annotated hypotheses, and tracking the convergence of different classes of words
  • Figure 2: Experiment 1: correlation with semantic and syntactic similarity measures. We compare the distance relationships in the model vocabulary embedding with the distances in different measures of semantic and syntactic similarity. (b) Model embeddings come to represent semantic similarities quickly, with correlations converging quite early in training. (c) Model embeddings correlate with syntactic structural RDMs early in training, peaking then plateauing. Note that the y-axes differ across the two plots: each syntactic hypothesis captures a relatively simple relation compared to the more complex semantic relationships, which likely explains the lower overall correlation plateaus.
  • Figure 3: Experiment 2: The effect of frequency on the vocabulary(a) Convergence of different frequency buckets (left): Frequent words (blue) converge to their final representations faster than infrequent words (orange; see inset). Less frequent words have correlated representational structures between their random initializations and their final checkpoints (right). The same figure, but with the x-axis rescaled independently for each line to reflect the expected number of times the model has seen the words in each frequency bucket, showcasing how frequent words evolve much slower per update. (b) Vocabulary embedding RDM correlations with frequency hypothesis distance matrices. During training, distances in vocabulary space gradually align with differences in frequency rank, though the relationship to raw frequency counts is non-monotonic.
  • Figure 4: Experiment 3 a series of analyses of what changes after linguistic feature stabilization. (a) Average distance between 1,000 random token embeddings and their embeddings at the final checkpoint (b) Correlation between embeddings and unembeddings, by frequency bucket (c) Top 10 words in Pythia that get closer between 20K and 142K steps. Full results in \ref{['app:qualitative_furthest']}, \ref{['fig:qualitative_closer_full']} & \ref{['fig:qualitative_farther']}.
  • Figure 5: OLMo results for Experiment 1
  • ...and 8 more figures