Explicitly Disentangled Representations in Object-Centric Learning
Riccardo Majellaro, Jonathan Collu, Aske Plaat, Thomas M. Moerland
TL;DR
This work addresses the challenge of learning robust, structured representations for multi-object scenes by introducing DISA, an object-centric model that explicitly disentangles texture and shape information into two pre-defined latent subspaces in addition to position and scale. Building on Invariant Slot Attention, DISA uses two encoders (one operating on the raw image and one on Sobel-filtered input) and two decoders (mask and texture) to separate shape from texture, aided by a simple latent-space variance regularizer. Empirically, DISA improves reconstruction quality and achieves clear disentanglement on several synthetic benchmarks, while enabling texture transfer and generative sampling that alter appearance without changing geometry. These results advance interpretability and compositional generalization in non-probabilistic object-centric learning, with practical implications for controllable scene generation and robust downstream reasoning.
Abstract
Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.
