Neurosymbolic Grounding for Compositional World Models
Atharva Sehgal, Arya Grayeli, Jennifer J. Sun, Swarat Chaudhuri
TL;DR
Cosmos tackles compositional generalization in world modeling by marrying neural, object-centric representations with symbolically grounded attributes derived from vision-language foundation models. The framework uses a Slot-based Autoencoder for object extraction, a Slm module to assign symbolic attribute vectors, and a Modular Transition Model that binds rules to objects via a neurosymbolic attention mechanism, enabling end-to-end differentiable learning. On a 2D block-pushing domain with entity and relational composition, Cosmos achieves state-of-the-art next-state prediction and superior downstream planning relative to fully neural and ablated baselines, while mitigating representation collapse. This work demonstrates that vision-language symbols can be used to guide modular dynamics without manual symbol engineering, pointing to richer symbolic reasoning and broader benchmarks as fruitful directions for future work.
Abstract
We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CompGen), i.e., high performance on unseen input scenes obtained through the composition of known visual "atoms." The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CompGen on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CompGen in world modeling. Artifacts are available at: https://trishullab.github.io/cosmos-web/
