Table of Contents
Fetching ...

Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data

Badr AlKhamissi, Yingtian Tang, Abdülkadir Gökce, Johannes Mehrer, Martin Schrimpf

TL;DR

A self-synthesis approach that iterates through four phases that sets up fundamental language abilities, and develops advanced cognitive skills, by training the model on specific tasks such as visual question answering and reasoning.

Abstract

While today's large language models exhibit impressive abilities in generating human-like text, they require massive amounts of data during training. We here take inspiration from human cognitive development to train models in limited data conditions. Specifically we present a self-synthesis approach that iterates through four phases: Phase 1 sets up fundamental language abilities, training the model from scratch on a small corpus. Language is then associated with the visual environment in phase 2, integrating the model with a vision encoder to generate descriptive captions from labeled images. In the "self-synthesis" phase 3, the model generates captions for unlabeled images, that it then uses to further train its language component with a mix of synthetic, and previous real-world text. This phase is meant to expand the model's linguistic repertoire, similar to humans self-annotating new experiences. Finally, phase 4 develops advanced cognitive skills, by training the model on specific tasks such as visual question answering and reasoning. Our approach offers a proof of concept for training a multimodal model using a developmentally plausible amount of data.

Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data

TL;DR

A self-synthesis approach that iterates through four phases that sets up fundamental language abilities, and develops advanced cognitive skills, by training the model on specific tasks such as visual question answering and reasoning.

Abstract

While today's large language models exhibit impressive abilities in generating human-like text, they require massive amounts of data during training. We here take inspiration from human cognitive development to train models in limited data conditions. Specifically we present a self-synthesis approach that iterates through four phases: Phase 1 sets up fundamental language abilities, training the model from scratch on a small corpus. Language is then associated with the visual environment in phase 2, integrating the model with a vision encoder to generate descriptive captions from labeled images. In the "self-synthesis" phase 3, the model generates captions for unlabeled images, that it then uses to further train its language component with a mix of synthetic, and previous real-world text. This phase is meant to expand the model's linguistic repertoire, similar to humans self-annotating new experiences. Finally, phase 4 develops advanced cognitive skills, by training the model on specific tasks such as visual question answering and reasoning. Our approach offers a proof of concept for training a multimodal model using a developmentally plausible amount of data.

Paper Structure

This paper contains 28 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Self-Synthesis Training Framework. Our model BabyLLaMA is trained in four phases that connect fundamental language abilities to vision by learning to describe unlabeled visual experiences. We divided our approach in 4 phases, each feeding its best snapshot in terms of validation loss to the next phase. Phase 1 is concerned with fundamental language skill acquisition using 50M words. Phase 2 combines visual and text data (35 M words) to learn to describe objects and scenes. In phase 3 - making our approach one revolving around self-synthesis - we generate captions from images and use this synthesized text (i.e., 0 words from real-world corpora) to further train the model's language decoder. Phase 4 closes the loop using 15M words to develop skills for advanced visuo-linguistic tasks such as question answering and reasoning about the environment.
  • Figure 2: Overview diagram illustrating the four phases of training. Starting from training on text only (phase 1), language capabilities are connected to images (phase 2). The model then self-synthesizes text (red border) on unseen images, and uses this text to continue training the language component (phase 3), which is further refined for e.g. question answering (phase 4). Sizes of model components do not reflect number of parameters.
  • Figure 3: Average performance on all language-only (left) and vision-language-benchmarks (right) across training phases. Each phase yields a small boost for its respective training objective.