A Grounded Typology of Word Classes
Coleman Haley, Sharon Goldwater, Edoardo Ponti
TL;DR
This work introduces a grounded typology by using images as language-neutral proxies for meaning and quantifying semantic contentfulness through a PMI-based groundedness measure. It compares image-conditioned captioning with text-only language models to estimate how much meaning is captured by word classes across 30 languages, revealing a robust lexical–functional gradient with nouns, adjectives, and verbs generally more grounded than functional classes. The approach yields a dataset of groundedness scores and demonstrates partial alignment with psycholinguistic concreteness norms, while challenging some assumptions about adpositions and semantic content in function words. Overall, the method provides a quantitative, cross-linguistic tool for studying semantic function in language, with potential for broader multimodal typology research and future data/model improvements.
Abstract
We propose a grounded approach to meaning in language typology. We treat data from perceptual modalities, such as images, as a language-agnostic representation of meaning. Hence, we can quantify the function--form relationship between images and captions across languages. Inspired by information theory, we define "groundedness", an empirical measure of contextual semantic contentfulness (formulated as a difference in surprisal) which can be computed with multilingual multimodal language models. As a proof of concept, we apply this measure to the typology of word classes. Our measure captures the contentfulness asymmetry between functional (grammatical) and lexical (content) classes across languages, but contradicts the view that functional classes do not convey content. Moreover, we find universal trends in the hierarchy of groundedness (e.g., nouns > adjectives > verbs), and show that our measure partly correlates with psycholinguistic concreteness norms in English. We release a dataset of groundedness scores for 30 languages. Our results suggest that the grounded typology approach can provide quantitative evidence about semantic function in language.
