Table of Contents
Fetching ...

Words That Make Language Models Perceive

Sophie L. Wang, Phillip Isola, Brian Cheung

TL;DR

This work investigates whether a text-only LLM can surface perceptual grounding through simple sensory prompts. By defining generative representations from autoregressive continuations and comparing their kernel-based geometry to unimodal vision and audio encoders, the authors show that prompting a model to 'see' or 'hear' can align its internal representations with modality-specific encoders, with alignment improving as generation length and model size increase. They demonstrate causal effects via sensory-word ablations and control for hallucinations, and extend the analysis to a text-space VQA task to confirm functional benefits. The findings suggest that perceptual grounding need not be rooted in multimodal training; inference-time prompts can steer purely textual models toward multimodal-like representations, offering practical paths for cross-modal retrieval, evaluation, and distillation while blurring the line between unimodal and multimodal systems.

Abstract

Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

Words That Make Language Models Perceive

TL;DR

This work investigates whether a text-only LLM can surface perceptual grounding through simple sensory prompts. By defining generative representations from autoregressive continuations and comparing their kernel-based geometry to unimodal vision and audio encoders, the authors show that prompting a model to 'see' or 'hear' can align its internal representations with modality-specific encoders, with alignment improving as generation length and model size increase. They demonstrate causal effects via sensory-word ablations and control for hallucinations, and extend the analysis to a text-space VQA task to confirm functional benefits. The findings suggest that perceptual grounding need not be rooted in multimodal training; inference-time prompts can steer purely textual models toward multimodal-like representations, offering practical paths for cross-modal retrieval, evaluation, and distillation while blurring the line between unimodal and multimodal systems.

Abstract

Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

Paper Structure

This paper contains 42 sections, 3 equations, 37 figures, 2 tables.

Figures (37)

  • Figure 1: A cue that asks the model to 'see' (or 'hear') the provided text description moves the kernel representation of the model closer to the specialist model given the image (or audio) modality.
  • Figure 2: Generative representations (no sensory cue from Figure \ref{['fig:sensory_prompting_trend_b']}, 128 token generations) yield higher alignment than single-pass embeddings.
  • Figure 3: Sensory cues induce a generative text-only LLM representation that has higher alignment with the corresponding encoder. The star denotes matching cue-modality in generative representation.
  • Figure 4: Snippets of text generated from the caption, under sensory cues. We highlight, by hand, words that may be associated with the sensory modality. Full example found in Appendix \ref{['app:128-token']}.
  • Figure 5: Selected examples where visual prompting yields the largest increase in shared top-$k=10$ neighbors with the vision encoder (vs. no cue). Blue outlines mark inputs also among the vision encoder’s nearest neighbors. Additional generations and examples appear in Appendix \ref{['app:examples']}.
  • ...and 32 more figures