Table of Contents
Fetching ...

Scalable Evaluation and Neural Models for Compositional Generalization

Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi

TL;DR

This work tackles the challenge of compositional generalization in vision by proposing a universal, scalable evaluation framework (orthotopic evaluation) that reduces combinatorial evaluation to a constant complexity and introduces a compositional similarity index $c$ to create a ladder of evaluation difficulty. It validates the framework with a large-scale benchmark across 5000+ vision models and datasets, and introduces Attribute Invariant Networks (AINs) that enforce attribute-invariant gradient updates, achieving a new Pareto frontier in scalability and generalization with modest parameter overhead. The study also compares against explicitly disentangled models and shows that while ED can yield high CG performance, AINs deliver competitive gains with far lower overhead, offering a practical path toward scalable compositional learning in supervised vision. The findings underscore the importance of accounting for CG difficulty via $c$, demonstrate the limits of standard architectures, and propose a concrete architectural pathway to improve compositional generalization in real-world settings. Code and resources are provided to enable reproducibility and wider adoption of the proposed benchmarks and AINs.

Abstract

Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts. Our code is available at https://github.com/IBM/scalable-compositional-generalization.

Scalable Evaluation and Neural Models for Compositional Generalization

TL;DR

This work tackles the challenge of compositional generalization in vision by proposing a universal, scalable evaluation framework (orthotopic evaluation) that reduces combinatorial evaluation to a constant complexity and introduces a compositional similarity index to create a ladder of evaluation difficulty. It validates the framework with a large-scale benchmark across 5000+ vision models and datasets, and introduces Attribute Invariant Networks (AINs) that enforce attribute-invariant gradient updates, achieving a new Pareto frontier in scalability and generalization with modest parameter overhead. The study also compares against explicitly disentangled models and shows that while ED can yield high CG performance, AINs deliver competitive gains with far lower overhead, offering a practical path toward scalable compositional learning in supervised vision. The findings underscore the importance of accounting for CG difficulty via , demonstrate the limits of standard architectures, and propose a concrete architectural pathway to improve compositional generalization in real-world settings. Code and resources are provided to enable reproducibility and wider adoption of the proposed benchmarks and AINs.

Abstract

Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts. Our code is available at https://github.com/IBM/scalable-compositional-generalization.

Paper Structure

This paper contains 74 sections, 2 theorems, 5 equations, 25 figures, 34 tables, 4 algorithms.

Key Result

Theorem 4.2

Let $(\mathbf{x}, \mathbf{y})$ be a sample, and let $f_j(\mathbf{x})$ be an AIN's logit corresponding to attribute $j$. Then, for every group action $\mathfrak{g}\in\mathfrak{G}_j$, if $j \neq i$, then $\nabla_{h_i} \mathcal{L}(y_j, f_j(\mathbf{x})) = \nabla_{h_i} \mathcal{L}(y_j, f_j(\mathfrak{g}.\

Figures (25)

  • Figure 1: Theoretical setup.
  • Figure 2: Orthotopic evaluation for compositional generalization. Intuitively, orthotopic OOD split generation works by iteratively projecting and pruning the data in every $c$-dimensional attribute's subspace. We exemplify this in (a) for the dataset Shapes3D, where we consider only $I=3$ attributes for simplicity. We highlight the disentangled compositional split ($c=1$, green+yellow) and the entangled compositional split ($c=2$, yellow). (b) pictures the proposed ladder of compositional evaluation difficulty, showing the dependence between the $c$ parameter and the similarity between train and test generative factors. This delineates a ladder of different difficulties of compositional evaluation, spanning from extrapolation to in-distribution regimes. Finally, (c) shows the computational advantage of the proposed evaluation technique compared to the naïve pair-wise evaluation strategy.
  • Figure 3: Orthotopic evaluation results. In (a), we report the test accuracy (on the compositional generalization split) for different values of the compositional similarity index $c$. The results are collected for six well-known representation learning datasets and grouped into the six major families of models considered in this work. The uncertainty (SEM) is reported over different models in the family (various model sizes, pre-training, etc.) and random seeds (3). (b) on the other hand compares the results obtained with orthotopic evaluation (represented now at the granularity of single models, averaged across different datasets and seeds, for $c=1$) with the results obtained with a more precise but inefficient evaluation technique, pair-wise evaluation. In order to extract a more general picture, we group the same results by model family in (c). The results reported in this figure are obtained from a large-scale analysis encompassing more than 5000 training runs.
  • Figure 4: (a) The function $f_{\text{shape}}$ is attribute invariant as the prediction is only affected by shape transformations $\mathfrak{g} \in \mathfrak{G}_{\text{shape}}$, while it remains unaffected under any other attribute (e.g., color) transformation $\mathfrak{g} \in \mathfrak{G}_{j}$ such that $j \neq \text{"shape"}$. (b) In Attribute Invariant Networks, only the shape encoder can receive a nonzero gradient for the shape attribute; all other attribute encoders receive a zero gradient. (c) Attribute Invariant Networks' size is nearly constant for different numbers of attributes, while ED models do not scale. Model size provided as mean and $95\%$ confidence interval over different residual networks (i.e., ResNet{18,34,50,101,152}).
  • Figure 5: Pareto optimality.
  • ...and 20 more figures

Theorems & Definitions (7)

  • Definition 2.1: Compositional generalization
  • Example 2.2
  • Definition 3.1: Compositional Generalization (complete)
  • Definition 4.1: Attribute invariance
  • Theorem 4.2: Attribute invariances in gradient updates
  • Theorem G.1: Attribute invariances in gradient updates
  • proof