Table of Contents
Fetching ...

Neurosymbolic Grounding for Compositional World Models

Atharva Sehgal, Arya Grayeli, Jennifer J. Sun, Swarat Chaudhuri

TL;DR

Cosmos tackles compositional generalization in world modeling by marrying neural, object-centric representations with symbolically grounded attributes derived from vision-language foundation models. The framework uses a Slot-based Autoencoder for object extraction, a Slm module to assign symbolic attribute vectors, and a Modular Transition Model that binds rules to objects via a neurosymbolic attention mechanism, enabling end-to-end differentiable learning. On a 2D block-pushing domain with entity and relational composition, Cosmos achieves state-of-the-art next-state prediction and superior downstream planning relative to fully neural and ablated baselines, while mitigating representation collapse. This work demonstrates that vision-language symbols can be used to guide modular dynamics without manual symbol engineering, pointing to richer symbolic reasoning and broader benchmarks as fruitful directions for future work.

Abstract

We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CompGen), i.e., high performance on unseen input scenes obtained through the composition of known visual "atoms." The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CompGen on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CompGen in world modeling. Artifacts are available at: https://trishullab.github.io/cosmos-web/

Neurosymbolic Grounding for Compositional World Models

TL;DR

Cosmos tackles compositional generalization in world modeling by marrying neural, object-centric representations with symbolically grounded attributes derived from vision-language foundation models. The framework uses a Slot-based Autoencoder for object extraction, a Slm module to assign symbolic attribute vectors, and a Modular Transition Model that binds rules to objects via a neurosymbolic attention mechanism, enabling end-to-end differentiable learning. On a 2D block-pushing domain with entity and relational composition, Cosmos achieves state-of-the-art next-state prediction and superior downstream planning relative to fully neural and ablated baselines, while mitigating representation collapse. This work demonstrates that vision-language symbols can be used to guide modular dynamics without manual symbol engineering, pointing to richer symbolic reasoning and broader benchmarks as fruitful directions for future work.

Abstract

We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CompGen), i.e., high performance on unseen input scenes obtained through the composition of known visual "atoms." The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CompGen on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CompGen in world modeling. Artifacts are available at: https://trishullab.github.io/cosmos-web/
Paper Structure (22 sections, 2 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 2 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of compositional world modeling. We depict examples from a 2D block pushing domain consisting of shapes that interact, where we can generate samples of different shapes and interactions. We aim to learn a model that generalizes to compositions not seen during training, such as entity composition (top) and relational composition (bottom). Previous works 2021goyalnps focus on entity composition, and struggle to generalize to harder compositional environments. Our approach Cosmos leverages object-centric, neurosymbolic scene encodings to compositionally generalize across settings containing different types of compositions.
  • Figure 2: Comparing world modeling frameworks between prior work 2021goyalnps and Cosmos. Both modules start with entity extraction, to obtain neural object representations $\{S_1, \dots S_k\}$ from the image (Section \ref{['sec:entity_extraction']}). While prior work uses this representation directly for the module selector, our work leverages a symbolic labeling module, which outputs a set of attributes $\Lambda$, to learn neurosymbolic representations (Section \ref{['sec:slm']}). We then perform action conditioning (Section \ref{['sec:action_conditioning']}) to keep track of corresponding actions, and update through a transition model (Section \ref{['sec:transition_model']}).
  • Figure 3: A single update step of Cosmos. The image $I$ is fed through a slot-based autoencoder and a CLIP model to generate the slot encodings $\{S_1, \dots S_k\}$ and the symbol vectors $\{\Lambda_1, \dots \Lambda_k\}$. The actions and the symbolic encoding are aligned and concatenated using a permutation equivariant action attention module, which are used to select the update rule to be applied to the slots. This figure depicts a single update step; in implementation, the update-select-transform step is repeated multiple times to model multi-object interactions.
  • Figure 4: Downstream utility of different world models using a greedy planner. The graph follows the average L1 error between the chosen next state and the ground truth next state as a function of the number of steps the model takes. A lower L1 error indicates better performance. Cosmos (in red) achieves the best performance.
  • Figure 5: Overview of types of compositions studied. Entity composition (left) necessitates learning a world model that is equivariant to object replacement. Relational compositions (right) necessitates learning the properties of entity composition as well as additional constraints where objects with shared attributes also have shared dynamics. We study two instantiations of shared attributes sets: "Sticky" and "Team". Details on these instantiations are given in Appendix \ref{['sec:types-of-compositions']}.
  • ...and 2 more figures