COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization
Yassine Taoudi-Benchekroun, Klim Troyan, Pascal Sager, Stefan Gerber, Lukas Tuggener, Benjamin Grewe
TL;DR
COGITAO introduces a modular, procedural generator for visual compositional generalization that decouples perception from rule-based transformation reasoning. By offering 28 atomic object-transformations with adjustable depth, grid configurations, and RGB/sequential renderings, it creates millions of controllable input-output tasks to probe systematic generalization. Baseline experiments with diverse architectures reveal that while in-domain performance is strong, out-of-distribution composition remains challenging, with two failure modes identified: dependence on training-time transformation sequences (ID bias) and inability to decompose and recompose atomic steps (structural composition failure). The framework thus provides a principled sandbox for developing architectures with truly compositional reasoning and transferable generalization, while enabling extensions toward world-model and real-world vision tasks.
Abstract
The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.
