Table of Contents
Fetching ...

Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?

Tatsuki Kuribayashi, Timothy Baldwin

TL;DR

This work investigates whether visual grounding accelerates hierarchical generalization in neural language learners by adapting a poverty-of-stimulus paradigm to vision–and–language inputs. It compares natural image–caption data with synthetic artificial data to test if vision provides disambiguating cues that bias learners toward hierarchical structures, particularly in subject–verb number agreement. The findings show that vision materially boosts hierarchical generalization in the artificial setting, especially early in training, but offers limited or no advantage in the natural setting, suggesting that additional biases or attentional signals (e.g., mutual gaze) may be necessary for effective multimodal grounding. These results highlight the nuanced role of grounding in data-efficient language learning and point to future work on designing biases and signals that enable robust cross-modal generalization in neural systems.

Abstract

Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human-LM gap arises. This study explores the advantage of grounded language acquisition, specifically the impact of visual information -- which humans can usually rely on but LMs largely do not have access to during language acquisition -- on syntactic generalization in LMs. Our experiments, following the poverty of stimulus paradigm under two scenarios (using artificial vs. naturalistic images), demonstrate that if the alignments between the linguistic and visual components are clear in the input, access to vision data does help with the syntactic generalization of LMs, but if not, visual input does not help. This highlights the need for additional biases or signals, such as mutual gaze, to enhance cross-modal alignment and enable efficient syntactic generalization in multimodal LMs.

Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?

TL;DR

This work investigates whether visual grounding accelerates hierarchical generalization in neural language learners by adapting a poverty-of-stimulus paradigm to vision–and–language inputs. It compares natural image–caption data with synthetic artificial data to test if vision provides disambiguating cues that bias learners toward hierarchical structures, particularly in subject–verb number agreement. The findings show that vision materially boosts hierarchical generalization in the artificial setting, especially early in training, but offers limited or no advantage in the natural setting, suggesting that additional biases or attentional signals (e.g., mutual gaze) may be necessary for effective multimodal grounding. These results highlight the nuanced role of grounding in data-efficient language learning and point to future work on designing biases and signals that enable robust cross-modal generalization in neural systems.

Abstract

Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human-LM gap arises. This study explores the advantage of grounded language acquisition, specifically the impact of visual information -- which humans can usually rely on but LMs largely do not have access to during language acquisition -- on syntactic generalization in LMs. Our experiments, following the poverty of stimulus paradigm under two scenarios (using artificial vs. naturalistic images), demonstrate that if the alignments between the linguistic and visual components are clear in the input, access to vision data does help with the syntactic generalization of LMs, but if not, visual input does not help. This highlights the need for additional biases or signals, such as mutual gaze, to enhance cross-modal alignment and enable efficient syntactic generalization in multimodal LMs.
Paper Structure (33 sections, 1 equation, 5 figures, 7 tables)

This paper contains 33 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of the experimental design. A vision-language neural model is trained on ambiguous data for a particular linguistic rule. Then, we test whether the model learned a cognitively plausible rule using data disambiguating the model's generalization. Through this experimental scheme, we adjust whether/how the visual information helps the model infer the proper linguistic generalization.
  • Figure 2: Images can explicate the subject--verb dependency. If a learner can ground cat, glasses, and walk to their visual components, they can disambiguate that what is walking is not glasses but cat; such information will potentially bias the learner's language acquisition in favor of the linguistically correct rule.
  • Figure 3: Generalization performance of the model initialized with Vit-base. The $x$-axis denotes the parameter update steps, and the $y$-axis denotes the preference for the Hierarchical generalization rule (F1 scores multiplied by 100). We adopted four settings with different injection rates of {0, 0.001, 0.005, 0.01}. The normal lines correspond to the model with visual input (), and the dashed lines correspond to the preference of those without visual input (). The chance rate of the F1 score is 50.
  • Figure 4: Relationship between encoders' ImageNet accuracy (x-axis) and their advantage in Hierarchical generalization (F1 score difference of $-$; y-axis). The F1 score is measured at several checkpoints during training (1000, 5000, and 10000).
  • Figure 5: Relationship between encoders' captioning performance in the validation set (x-axis) and their advantage in Hierarchical generalization (F1 score difference of $-$; y-axis). These scores are measured at several checkpoints during training (1000, 5000, and 10000).