Table of Contents
Fetching ...

Language Models Don't Learn the Physical Manifestation of Language

Bruce W. Lee, JaeHyuk Lim

TL;DR

This work investigates whether language-only models can truly grasp the physical manifestation of language by introducing H-Test, a battery of visuospatial and auditory tasks. Across a range of language-only models, results cluster near random performance, suggesting strong sensory-grounding blind spots that such models do not readily bridge through scaling or in-context learning. Some multimodal systems (e.g., GPT-4o, Claude 3 Opus) show improved performance on parts of H-Test, indicating that sensory grounding or architectural differences may enable solving the tasks, though the exact mechanisms remain unclear. Grounded in the Mary’s Room analogy, the paper argues for the necessity of sensory experience or alternative architectures to achieve robust, human-like language understanding, and it highlights multiple limitations and future directions for grounding language models in perception.

Abstract

We argue that language-only models don't learn the physical manifestation of language. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. These tasks highlight a fundamental gap between human linguistic understanding and the sensory-deprived linguistic understanding of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) has no significant effect on H-Test performance. We bring in the philosophical case of Mary, who learns about the world in a sensory-deprived environment as a useful conceptual framework to understand how language-only models learn about the world (Jackson, 1986). Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of linguistic knowledge acquired in the absence of sensory experience. Our code and data are available at <github.com/brucewlee/h-test>.

Language Models Don't Learn the Physical Manifestation of Language

TL;DR

This work investigates whether language-only models can truly grasp the physical manifestation of language by introducing H-Test, a battery of visuospatial and auditory tasks. Across a range of language-only models, results cluster near random performance, suggesting strong sensory-grounding blind spots that such models do not readily bridge through scaling or in-context learning. Some multimodal systems (e.g., GPT-4o, Claude 3 Opus) show improved performance on parts of H-Test, indicating that sensory grounding or architectural differences may enable solving the tasks, though the exact mechanisms remain unclear. Grounded in the Mary’s Room analogy, the paper argues for the necessity of sensory experience or alternative architectures to achieve robust, human-like language understanding, and it highlights multiple limitations and future directions for grounding language models in perception.

Abstract

We argue that language-only models don't learn the physical manifestation of language. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. These tasks highlight a fundamental gap between human linguistic understanding and the sensory-deprived linguistic understanding of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) has no significant effect on H-Test performance. We bring in the philosophical case of Mary, who learns about the world in a sensory-deprived environment as a useful conceptual framework to understand how language-only models learn about the world (Jackson, 1986). Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of linguistic knowledge acquired in the absence of sensory experience. Our code and data are available at <github.com/brucewlee/h-test>.
Paper Structure (16 sections, 7 figures, 8 tables)

This paper contains 16 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Conundrum: What information is fundamentally absent in the current language training dynamics?
  • Figure 2: Making Progress on Language-only Modeling Does Not Trivially Solve H-Test: We test weaker models from the same family for models given in Table \ref{['tab:proprietary']} under the same few-shot (at k = 50) setup. The graphs for the Luminous model family are also shown in magnified versions to show that we are not depicting a flat line.
  • Figure 3: Current LLMs Do Not Solve H-Test Better with More Examples: We test four models from Figure \ref{['fig:intrafamily']} and test with different number of examples (k = {4, 14, 28, 50}). Though we acknowledge that subtask accuracy does vary at different few-shot setups, giving more or fewer examples does not significantly alter the H-Test performance on average.
  • Figure 4: H-Test is Not Meant to be Deliberately Reasoned: We test four instruction-following and test with and without CoT prompt at k = 14. In general, we observe that CoT decreases performance. Adj. accuracy depicts the score, excluding the cases where the model did not generate a clearly interpretable CoT response.
  • Figure 5: H-test vs. Letter Geometry: We compare the accuracy of seven models on H-Test and Letter Geometry. The red line represents the linear best fit.
  • ...and 2 more figures