Table of Contents
Fetching ...

Enhancing Compositional Generalization via Compositional Feature Alignment

Haoxiang Wang, Haozhe Si, Huajie Shao, Han Zhao

TL;DR

This work tackles compositional generalization (CG) in multi-domain, multi-class settings by introducing CG-Bench, a CG benchmark suite derived from real-world image datasets, and showing that standard pretraining-finetuning on vision foundation models struggles with CG. It proposes Compositional Feature Alignment (CFA), a simple two-stage finetuning method that trains two orthogonal linear heads (class and domain) on a frozen encoder and then finetunes the encoder with the heads frozen, under a normalization constraint that promotes a compositional feature structure. The authors provide a theoretical guarantee under a neural-collapse–inspired framework, demonstrating that CFA drives features toward a decomposition z_i^* = W_1^T a_{y_i} + W_2^T b_{e_i} in orthogonal subspaces, which supports better generalization to unseen domain-class pairs. Empirically, CFA improves CG performance on CG-Bench for both CLIP and DINOv2, often surpassing standard fine-tuning and LP-FT baselines, with WiSE-FT postprocessing offering additional gains; CFA maintains ID performance while reducing OOD degradation, highlighting its practical potential for CG under distribution shifts.

Abstract

Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning.

Enhancing Compositional Generalization via Compositional Feature Alignment

TL;DR

This work tackles compositional generalization (CG) in multi-domain, multi-class settings by introducing CG-Bench, a CG benchmark suite derived from real-world image datasets, and showing that standard pretraining-finetuning on vision foundation models struggles with CG. It proposes Compositional Feature Alignment (CFA), a simple two-stage finetuning method that trains two orthogonal linear heads (class and domain) on a frozen encoder and then finetunes the encoder with the heads frozen, under a normalization constraint that promotes a compositional feature structure. The authors provide a theoretical guarantee under a neural-collapse–inspired framework, demonstrating that CFA drives features toward a decomposition z_i^* = W_1^T a_{y_i} + W_2^T b_{e_i} in orthogonal subspaces, which supports better generalization to unseen domain-class pairs. Empirically, CFA improves CG performance on CG-Bench for both CLIP and DINOv2, often surpassing standard fine-tuning and LP-FT baselines, with WiSE-FT postprocessing offering additional gains; CFA maintains ID performance while reducing OOD degradation, highlighting its practical potential for CG under distribution shifts.

Abstract

Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning.
Paper Structure (34 sections, 4 theorems, 18 equations, 7 figures, 7 tables)

This paper contains 34 sections, 4 theorems, 18 equations, 7 figures, 7 tables.

Key Result

Theorem 1

Assuming the feature dimension $d$ is no smaller than $K+E$, and training data exists for each class and domain (though not necessarily for each domain-class combination), and $W_1$ and $W_2$ are normalized and span orthogonal subspaces such that $W_1 \in \mathcal{U}(d)^K, W_2 \in \mathcal{U}(d)^{E} where $\boldsymbol{a}_{y_i}\in \mathbb{R}^{K}$ is a vector depending on the class label $y_i$, and

Figures (7)

  • Figure 1: Compositional generalization (CG) vs. domain generalization (DG). Masked entries are unseen domain-class combinations, while unmasked ones exist in the training dataset.
  • Figure 2: Illustration of a desired compositional feature structure for compositional generalization.
  • Figure 3: Illustration of our proposed method, Compositional Feature Alignment (CFA).
  • Figure 4: Test accuracy statistics for each domain-class combination on DomainNet dataset. Left: The test accuracy compared with the number of training data. Points are median accuracy while the shading area is bounded by 25% and 75% quantiles. Right: The number of domain-class combinations at different zero-shot test accuracies.
  • Figure 5: Visualization of the features for CLIP ViT-B/16 before and after the finetunning of CFA. Left: Feature extracted using pre-trained CLIP ViT-B/16 image encoder. Right: Feature extracted using CFA-finetuned CLIP ViT-B/16 image encoder.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 1: Compositional Feature Structure
  • Theorem 1: Feature Alignment
  • Definition 2: Simplex-Encoding Label Matrices
  • Definition 3: SVD of Heads
  • Lemma 1: Optimum of Class Loss
  • proof
  • Lemma 2: Optimum of Domain Loss
  • proof
  • Theorem 2: Feature Alignment (\ref{['thm:feature-alignment']} Restated)
  • proof