Table of Contents
Fetching ...

Iterated Learning Improves Compositionality in Large Vision-Language Models

Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

TL;DR

The paper tackles the lack of compositionality in large vision–language models by introducing an iterated learning framework inspired by cultural transmission and framed as a Lewis Signaling Game between vision and language agents. A shared codebook constrains representations, and the language agent is periodically reset across generations, with a distillation step to stabilize the codebook, promoting easier-to-learn, more compositional representations. Empirical results on CC3M/CC12M show substantial gains on compositional benchmarks such as SugarCrepe and CREPE while preserving recognition performance, supported by analyses indicating a more interpretable codebook and smoother representations across generations. This approach demonstrates a viable pathway to imbue large multimodal models with compositional structure and suggests broader applicability to other domains requiring systematic generalization.

Abstract

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover, prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission-the need to teach a new generation-as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically, we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent, and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration, this training paradigm induces representations that become "easier to learn", a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the SugarCrepe benchmark.

Iterated Learning Improves Compositionality in Large Vision-Language Models

TL;DR

The paper tackles the lack of compositionality in large vision–language models by introducing an iterated learning framework inspired by cultural transmission and framed as a Lewis Signaling Game between vision and language agents. A shared codebook constrains representations, and the language agent is periodically reset across generations, with a distillation step to stabilize the codebook, promoting easier-to-learn, more compositional representations. Empirical results on CC3M/CC12M show substantial gains on compositional benchmarks such as SugarCrepe and CREPE while preserving recognition performance, supported by analyses indicating a more interpretable codebook and smoother representations across generations. This approach demonstrates a viable pathway to imbue large multimodal models with compositional structure and suggests broader applicability to other domains requiring systematic generalization.

Abstract

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover, prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission-the need to teach a new generation-as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically, we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent, and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration, this training paradigm induces representations that become "easier to learn", a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the SugarCrepe benchmark.
Paper Structure (20 sections, 3 equations, 9 figures, 10 tables)

This paper contains 20 sections, 3 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: (1). From studying the language that emerged from Lewis Signaling Game, evolutionary linguistics found that iterated learning with cultural transmission leads to language compositionality. (2). We interpret vision-language model training as Lewis Signaling Game between neural agents, and discovered iterated learning can also improve the compositionality of vision-language model's representation
  • Figure 2: Our iterated learning algorithm is built on CLIP augmented with a shared codebook. The algorithm consists of a warmup stage and three iterated phases that cycle until the end of training. In each cycle, we 1) spawn a new language agent to replace the old one. 2) frozen codebook weight for a certain number of steps. 3) let agents interact under standard vision-language contrastive learning.
  • Figure 3: Iterated learning loss curve. Cross-modality alignment steadily improves across generations.
  • Figure 4: Estimated Upper bound of Lipschitz Constant for Codebook-CLIP and different generations of IL-CLIP (log scale).
  • Figure 5: Plot of in-batch image text accuracy vs. training step when a new language encoder is trained to align with fixed visual representation. We compare between visual representation produced with iterated learning (left) and without iterated learning (right).
  • ...and 4 more figures