Table of Contents
Fetching ...

The Zero Body Problem: Probing LLM Use of Sensory Language

Rebecca M. M. Hicke, Sil Hamilton, David Mimno

TL;DR

The paper investigates whether large language models (LLMs) can approximate human-like embodied sensory language despite lacking real perception. It expands a parallel human–model short-story corpus to $20{,}000$ texts across $19$ models and measures sensory usage along $12$ axes using established sensorimotor and concreteness lexicons. Key findings show Gemini models emit substantially more sensory language than humans, while most other families emit less, and linear probes reveal that models can encode sensory content in latent representations even as their actual usage diverges. The study further links instruction tuning via RLHF to reductions in sensory language, and provides a larger dataset for future work, highlighting practical implications for empathy-driven or perceptual applications in AI.

Abstract

Sensory language expresses embodied experiences ranging from taste and sound to excitement and stomachache. This language is of interest to scholars from a wide range of domains including robotics, narratology, linguistics, and cognitive science. In this work, we explore whether language models, which are not embodied, can approximate human use of embodied language. We extend an existing corpus of parallel human and model responses to short story prompts with an additional 18,000 stories generated by 18 popular models. We find that all models generate stories that differ significantly from human usage of sensory language, but the direction of these differences varies considerably between model families. Namely, Gemini models use significantly more sensory language than humans along most axes whereas most models from the remaining five families use significantly less. Linear probes run on five models suggest that they are capable of identifying sensory language. However, we find preliminary evidence suggesting that instruction tuning may discourage usage of sensory language. Finally, to support further work, we release our expanded story dataset.

The Zero Body Problem: Probing LLM Use of Sensory Language

TL;DR

The paper investigates whether large language models (LLMs) can approximate human-like embodied sensory language despite lacking real perception. It expands a parallel human–model short-story corpus to texts across models and measures sensory usage along axes using established sensorimotor and concreteness lexicons. Key findings show Gemini models emit substantially more sensory language than humans, while most other families emit less, and linear probes reveal that models can encode sensory content in latent representations even as their actual usage diverges. The study further links instruction tuning via RLHF to reductions in sensory language, and provides a larger dataset for future work, highlighting practical implications for empathy-driven or perceptual applications in AI.

Abstract

Sensory language expresses embodied experiences ranging from taste and sound to excitement and stomachache. This language is of interest to scholars from a wide range of domains including robotics, narratology, linguistics, and cognitive science. In this work, we explore whether language models, which are not embodied, can approximate human use of embodied language. We extend an existing corpus of parallel human and model responses to short story prompts with an additional 18,000 stories generated by 18 popular models. We find that all models generate stories that differ significantly from human usage of sensory language, but the direction of these differences varies considerably between model families. Namely, Gemini models use significantly more sensory language than humans along most axes whereas most models from the remaining five families use significantly less. Linear probes run on five models suggest that they are capable of identifying sensory language. However, we find preliminary evidence suggesting that instruction tuning may discourage usage of sensory language. Finally, to support further work, we release our expanded story dataset.

Paper Structure

This paper contains 11 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparisons of model and human use of sensory usage along each of the twelve axes for each model; values $<0$ indicate the model uses more sensory language and vice versa. The paired t-test statistics (left) and average differences (right) demonstrate significant differences in model and human language usage for most models along most axes. The dotted vertical lines on the left figure mark significance at $\alpha=0.05$; any t-test statistics $\geq1.96$ or $\leq -1.96$ represents a significant difference in model and human language usage.
  • Figure 2: The average performance (F1) of 100 logistic regression models in distinguishing between texts written by humans and texts written by each model. Models are trained with different 50:50 training and test splits. The error bars represent the standard deviation over all 100 model runs and the dotted line marks expected random performance.
  • Figure 3: The average feature importance from each logistic regression model trained to distinguish between models and humans. Each dot represents an average over 100 models.