Table of Contents
Fetching ...

Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

Maria Andueza Rodriguez, Marie Candito, Richard Huyghe

Abstract

Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

Abstract

Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.
Paper Structure (21 sections, 3 equations, 12 figures, 3 tables)

This paper contains 21 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Distribution of log-transformed response/cue frequency ratios across respondents, considering responses provided by humans and LLMs at Rank 1, with LLM temperature set to 1.
  • Figure 2: Average relative frequency of R1 responses across cue frequency bins (LLM temperature = 1). The number of cues included in each bin is indicated below or above the bars.
  • Figure 3: Distribution of response/cue concreteness ratios across respondents, considering responses provided by humans and LLMs at Rank 1, with LLM temperature set to 1.
  • Figure 4: Average relative concreteness of R1 responses across cue concreteness bins (LLM temperature = 1). The number of cues included in each bin is indicated above the bars.
  • Figure 5: Distribution of variability ($\#R1$) and typicality ($tok\text{-}SS1$) in human and LLM datasets at temperature 1. Grey boxplots (right y-axis) represent variability, while blue boxplots (left y-axis) represent typicality.
  • ...and 7 more figures