Table of Contents
Fetching ...

Acquiring Linguistic Knowledge from Multimodal Input

Theodor Amariucai, Alex Warstadt

TL;DR

This study interrogates whether visual grounding can reduce the data efficiency gap between language models and humans by injecting multimodal input during pretraining. Using the FLAVA architecture with multitask training on the WiT dataset, the authors explore eight configurations that vary text and image volumes and evaluate on BabyLM benchmarks. Across BLiMP, GLUE/SuperGLUE, and MSGS, the results show no consistent improvements from multimodal input, though there are occasional marginal benefits at the smallest data scale and a potential regularizing effect. The work highlights limitations such as few repeats per configuration and calls for better multimodal architectures and training strategies to realize grounding benefits in language learning.

Abstract

In contrast to children, language models (LMs) exhibit considerably inferior data efficiency when acquiring language. In this submission to the BabyLM Challenge (Warstadt et al., 2023), we test the hypothesis that this data efficiency gap is partly caused by a lack of multimodal input and grounding in the learning environment of typical language models. Although previous work looking into this question found that multimodal training can even harm language-only performance, we speculate that these findings can be attributed to catastrophic forgetting of complex language due to fine-tuning on captions data. To test our hypothesis, we perform an ablation study on FLAVA (Singh et al., 2022), a multimodal vision-and-language model, independently varying the volume of text and vision input to quantify how much text data (if any) can be offset by vision at different data scales. We aim to limit catastrophic forgetting through a multitask pretraining regime that includes unimodal text-only tasks and data sampled from WiT, the relatively diverse Wikipedia-based dataset (Srinivasan et al., 2021). Our results are largely negative: Multimodal pretraining does not harm our models' language performance but does not consistently help either. That said, our conclusions are limited by our having been able to conduct only a small number of runs. While we must leave open the possibility that multimodal input explains some of the gap in data efficiency between LMs and humans, positive evidence for this hypothesis will require better architectures and techniques for multimodal training.

Acquiring Linguistic Knowledge from Multimodal Input

TL;DR

This study interrogates whether visual grounding can reduce the data efficiency gap between language models and humans by injecting multimodal input during pretraining. Using the FLAVA architecture with multitask training on the WiT dataset, the authors explore eight configurations that vary text and image volumes and evaluate on BabyLM benchmarks. Across BLiMP, GLUE/SuperGLUE, and MSGS, the results show no consistent improvements from multimodal input, though there are occasional marginal benefits at the smallest data scale and a potential regularizing effect. The work highlights limitations such as few repeats per configuration and calls for better multimodal architectures and training strategies to realize grounding benefits in language learning.

Abstract

In contrast to children, language models (LMs) exhibit considerably inferior data efficiency when acquiring language. In this submission to the BabyLM Challenge (Warstadt et al., 2023), we test the hypothesis that this data efficiency gap is partly caused by a lack of multimodal input and grounding in the learning environment of typical language models. Although previous work looking into this question found that multimodal training can even harm language-only performance, we speculate that these findings can be attributed to catastrophic forgetting of complex language due to fine-tuning on captions data. To test our hypothesis, we perform an ablation study on FLAVA (Singh et al., 2022), a multimodal vision-and-language model, independently varying the volume of text and vision input to quantify how much text data (if any) can be offset by vision at different data scales. We aim to limit catastrophic forgetting through a multitask pretraining regime that includes unimodal text-only tasks and data sampled from WiT, the relatively diverse Wikipedia-based dataset (Srinivasan et al., 2021). Our results are largely negative: Multimodal pretraining does not harm our models' language performance but does not consistently help either. That said, our conclusions are limited by our having been able to conduct only a small number of runs. While we must leave open the possibility that multimodal input explains some of the gap in data efficiency between LMs and humans, positive evidence for this hypothesis will require better architectures and techniques for multimodal training.
Paper Structure (22 sections, 3 figures, 3 tables)

This paper contains 22 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: PPPL performance for the two data volumes of 10M and 100M words. The training steps on the x-axis are counted across all objectives.
  • Figure 3: Zero-shot accuracies, in percentages, obtained on the BLiMP task for each grammatical category (x12) and FLAVA run configuration of input text volume (10M and 100M words) and input vision volume (0, 40K, 400K and 4M images). The model checkpoints used to generate these results were selected as described in Table \ref{['tab:best_checkpoints']}.
  • Figure 4: Validation losses for every training objective on a held-out set. While the MLM -- and to a certain extent, also the MMM (Text) -- losses are closely proportional to the pseudo-perplexity metric in Figure \ref{['fig:pppl_eval']} (including some occasional spikes associated with checkpoint loading), the other losses are less stable. We point out some issues with the scheduler mechanism in Sections \ref{['subsec:pppl']} and \ref{['sec:grounding']}.