Scalable Evaluation and Neural Models for Compositional Generalization
Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi
TL;DR
This work tackles the challenge of compositional generalization in vision by proposing a universal, scalable evaluation framework (orthotopic evaluation) that reduces combinatorial evaluation to a constant complexity and introduces a compositional similarity index $c$ to create a ladder of evaluation difficulty. It validates the framework with a large-scale benchmark across 5000+ vision models and datasets, and introduces Attribute Invariant Networks (AINs) that enforce attribute-invariant gradient updates, achieving a new Pareto frontier in scalability and generalization with modest parameter overhead. The study also compares against explicitly disentangled models and shows that while ED can yield high CG performance, AINs deliver competitive gains with far lower overhead, offering a practical path toward scalable compositional learning in supervised vision. The findings underscore the importance of accounting for CG difficulty via $c$, demonstrate the limits of standard architectures, and propose a concrete architectural pathway to improve compositional generalization in real-world settings. Code and resources are provided to enable reproducibility and wider adoption of the proposed benchmarks and AINs.
Abstract
Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts. Our code is available at https://github.com/IBM/scalable-compositional-generalization.
