Enhancing Compositional Generalization via Compositional Feature Alignment
Haoxiang Wang, Haozhe Si, Huajie Shao, Han Zhao
TL;DR
This work tackles compositional generalization (CG) in multi-domain, multi-class settings by introducing CG-Bench, a CG benchmark suite derived from real-world image datasets, and showing that standard pretraining-finetuning on vision foundation models struggles with CG. It proposes Compositional Feature Alignment (CFA), a simple two-stage finetuning method that trains two orthogonal linear heads (class and domain) on a frozen encoder and then finetunes the encoder with the heads frozen, under a normalization constraint that promotes a compositional feature structure. The authors provide a theoretical guarantee under a neural-collapse–inspired framework, demonstrating that CFA drives features toward a decomposition z_i^* = W_1^T a_{y_i} + W_2^T b_{e_i} in orthogonal subspaces, which supports better generalization to unseen domain-class pairs. Empirically, CFA improves CG performance on CG-Bench for both CLIP and DINOv2, often surpassing standard fine-tuning and LP-FT baselines, with WiSE-FT postprocessing offering additional gains; CFA maintains ID performance while reducing OOD degradation, highlighting its practical potential for CG under distribution shifts.
Abstract
Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning.
