Table of Contents
Fetching ...

Composition-Incremental Learning for Compositional Generalization

Zhen Li, Yuwei Wu, Chenchen Jing, Che Sun, Chuanhao Li, Yunde Jia

TL;DR

This work defines Composition-Incremental Learning for Compositional Generalization (CompIL) to study progressive learning of new compositions in CZSL under a continual data stream. It introduces a benchmark construction pipeline that yields MIT-States-CompIL and C-GQA-CompIL, and a pseudo-replay framework that synthesizes visual composition representations with a VS and preserves aligned primitive representations via linguistic primitive distillation. The approach leverages a pretrained vision-language model (CLIP) for cross-modal synthesis and distillation, and demonstrates improvements in unseen composition recognition and reduced forgetting on two CZSL models. The results validate that progressive exposure to diverse compositions can substantially bolster compositional generalization in dynamic, long-tailed data regimes.

Abstract

Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

Composition-Incremental Learning for Compositional Generalization

TL;DR

This work defines Composition-Incremental Learning for Compositional Generalization (CompIL) to study progressive learning of new compositions in CZSL under a continual data stream. It introduces a benchmark construction pipeline that yields MIT-States-CompIL and C-GQA-CompIL, and a pseudo-replay framework that synthesizes visual composition representations with a VS and preserves aligned primitive representations via linguistic primitive distillation. The approach leverages a pretrained vision-language model (CLIP) for cross-modal synthesis and distillation, and demonstrates improvements in unseen composition recognition and reduced forgetting on two CZSL models. The results validate that progressive exposure to diverse compositions can substantially bolster compositional generalization in dynamic, long-tailed data regimes.

Abstract

Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

Paper Structure

This paper contains 29 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Accuracy of CZSL models on the unseen composition test set trained with data containing varying number of compositions or samples. Increasing training compositions boosts compositional generalization significantly more than increasing training samples.
  • Figure 2: Illustration of our CompIL setting, taking the proposed MIT-States-CompIL benchmark as an example. The left side illustrates training samples from the tasks, where samples from different tasks vary significantly in semantics. For instance, the attribute "ancient" describes buildings in task $t-1$, outdated technology in task $t$, and natural landscapes in task $t+1$; the object "castle" varies likewise. The right side depicts the evaluation process, with rows representing training steps and each column indicating performance on the corresponding task.
  • Figure 3: Overview of the pseudo-replay framework in the context of compositional zero-shot learning. The CZSL model contains a visual encoder $\mathit{VE}_t$ and a language encoder $\mathit{LE}_t$. For simplicity, we omit possible connections between two components.
  • Figure 4: The architecture and training of the visual synthesizer. The visual synthesizer comprises an encoder $\mathit{VS}^E_t$ and a generator $\mathit{VS}^G_t$. It is trained to synthesize visual representations conditioned on the given composition name.
  • Figure 5: Compositional generalization capability of CSP, equipped with state-of-the-art methods, after training on different tasks in the compositional incremental learning process.The red line indicates CLIP's zero-shot performance.
  • ...and 1 more figures