Table of Contents
Fetching ...

COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

Yassine Taoudi-Benchekroun, Klim Troyan, Pascal Sager, Stefan Gerber, Lukas Tuggener, Benjamin Grewe

TL;DR

COGITAO introduces a modular, procedural generator for visual compositional generalization that decouples perception from rule-based transformation reasoning. By offering 28 atomic object-transformations with adjustable depth, grid configurations, and RGB/sequential renderings, it creates millions of controllable input-output tasks to probe systematic generalization. Baseline experiments with diverse architectures reveal that while in-domain performance is strong, out-of-distribution composition remains challenging, with two failure modes identified: dependence on training-time transformation sequences (ID bias) and inability to decompose and recompose atomic steps (structural composition failure). The framework thus provides a principled sandbox for developing architectures with truly compositional reasoning and transferable generalization, while enabling extensions toward world-model and real-world vision tasks.

Abstract

The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.

COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

TL;DR

COGITAO introduces a modular, procedural generator for visual compositional generalization that decouples perception from rule-based transformation reasoning. By offering 28 atomic object-transformations with adjustable depth, grid configurations, and RGB/sequential renderings, it creates millions of controllable input-output tasks to probe systematic generalization. Baseline experiments with diverse architectures reveal that while in-domain performance is strong, out-of-distribution composition remains challenging, with two failure modes identified: dependence on training-time transformation sequences (ID bias) and inability to decompose and recompose atomic steps (structural composition failure). The framework thus provides a principled sandbox for developing architectures with truly compositional reasoning and transferable generalization, while enabling extensions toward world-model and real-world vision tasks.

Abstract

The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.

Paper Structure

This paper contains 35 sections, 1 equation, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Set of input-output pair examples from our COGITAO generator, with input grids on top rows, and corresponding output grids (after transforming input) on the bottom rows. Each input-output pair follows a different transformation sequence (see boxes on middle row) and a given grid/object parametrization. For a detailed overview of transformations, we refer the reader to Appendix \ref{['appendix:COGITAO_transformations']}.
  • Figure 2: COGITAO is a Python-based procedural and object-centric data generator, inspired by the ARC-AGI cholletARC grid-based environment. COGITAO samples transformation sequences, then, given the configuration requested by the user and the sampled transformations, randomly samples and position objects in an input grid. Once objects are positioned in the input grid, each transformation is sequentially applied to each object.
  • Figure 3: Overview of Sequential-COGITAO. Episodes of are generated from an initially sampled transformation sequence and grid parametrization. Each individual frame is saved, along with its corresponding transformations. We note that transitions for each individual objects are also available, but not shown here for visualization constraints.
  • Figure 4: Set of input-output pair examples from our RGB rendering of COGITAO with input images on top rows, and corresponding output images (after transforming input) on the bottom rows. The images are 128x128x3, saved as .jpeg. Gray borders are added for visualization purposes only - they are not present on the images. Note: Objects are purposely blurry to outline their RGB nature - crisper rendering are straightforward.
  • Figure 5: Random example of generated objects. The objects are generated with a variety of properties, including size, symmetry, connectivity, colors, color patterns, and footprints. Note: objects are not allowed to overlap, touch, or be inside one another in our generator environment, as reflected in the above image.