Table of Contents
Fetching ...

Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

Michael A. Lepori, Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick

Abstract

Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.

Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

Abstract

Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.

Paper Structure

This paper contains 32 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (Left) Diagram describing how we create modal difference vectors. In this example, a modal difference vector capturing the difference between probable and impossible stimuli is created by taking the mean over differences in hidden representations. (Right) Diagram describing how modal difference vectors are used to classify novel minimal pairs of impossible/probable sentences. Hidden representations from each sentence are projected onto the modal difference vector, and the magnitudes of these projections are compared.
  • Figure 2: Classification evaluations for models with at least 2B parameters. Results are averages across models and generalization datasets. Modal difference vectors outperform probability estimates and other projection-based classification baselines.
  • Figure 3: (a) Average generalization performance vs. parameter count reveals a large performance gap between models with fewer/greater than 2B parameters. At (b) smaller scales, (c) earlier in training, and (d) in earlier layers, models form modal difference vectors that can differentiate inconceivable stimuli from other modal categories. After that, models learn the distinction between probable and impossible, then probable and improbable, and finally improbable and impossible.
  • Figure 4: (Left) A qualitative example of stimuli from hu_shades_2025 projected along two modal difference vectors. Dots are colored according to their expert label. Background color intensity represents the probability that each point belongs to a particular class according to a logistic regression model fit to this subset of data using these two features. (Right) (a) Pearson correlation between the predicted probability distributions and the empirical proportion of participants that selected each category. (b) Mean squared error between predicted and empirical response distributions. (c) Pearson correlation between the entropy of predicted and empirical response distributions. In all analyses, we find that featurizing using projections along modal difference vectors leads to better models of human categorization behavior.
  • Figure 5: Absolute correlations between projections along modal difference vectors and interpretable features (averaged over models). Notably, Probable-Improbable correlates with human subjective event likelihood judgments, and Impossible-Inconceivable correlates selectively with imageability, the presence of physical objects, and places/environments.
  • ...and 3 more figures