Composition-Incremental Learning for Compositional Generalization
Zhen Li, Yuwei Wu, Chenchen Jing, Che Sun, Chuanhao Li, Yunde Jia
TL;DR
This work defines Composition-Incremental Learning for Compositional Generalization (CompIL) to study progressive learning of new compositions in CZSL under a continual data stream. It introduces a benchmark construction pipeline that yields MIT-States-CompIL and C-GQA-CompIL, and a pseudo-replay framework that synthesizes visual composition representations with a VS and preserves aligned primitive representations via linguistic primitive distillation. The approach leverages a pretrained vision-language model (CLIP) for cross-modal synthesis and distillation, and demonstrates improvements in unseen composition recognition and reduced forgetting on two CZSL models. The results validate that progressive exposure to diverse compositions can substantially bolster compositional generalization in dynamic, long-tailed data regimes.
Abstract
Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.
