Table of Contents
Fetching ...

Explicitly Disentangled Representations in Object-Centric Learning

Riccardo Majellaro, Jonathan Collu, Aske Plaat, Thomas M. Moerland

TL;DR

This work addresses the challenge of learning robust, structured representations for multi-object scenes by introducing DISA, an object-centric model that explicitly disentangles texture and shape information into two pre-defined latent subspaces in addition to position and scale. Building on Invariant Slot Attention, DISA uses two encoders (one operating on the raw image and one on Sobel-filtered input) and two decoders (mask and texture) to separate shape from texture, aided by a simple latent-space variance regularizer. Empirically, DISA improves reconstruction quality and achieves clear disentanglement on several synthetic benchmarks, while enabling texture transfer and generative sampling that alter appearance without changing geometry. These results advance interpretability and compositional generalization in non-probabilistic object-centric learning, with practical implications for controllable scene generation and robust downstream reasoning.

Abstract

Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.

Explicitly Disentangled Representations in Object-Centric Learning

TL;DR

This work addresses the challenge of learning robust, structured representations for multi-object scenes by introducing DISA, an object-centric model that explicitly disentangles texture and shape information into two pre-defined latent subspaces in addition to position and scale. Building on Invariant Slot Attention, DISA uses two encoders (one operating on the raw image and one on Sobel-filtered input) and two decoders (mask and texture) to separate shape from texture, aided by a simple latent-space variance regularizer. Empirically, DISA improves reconstruction quality and achieves clear disentanglement on several synthetic benchmarks, while enabling texture transfer and generative sampling that alter appearance without changing geometry. These results advance interpretability and compositional generalization in non-probabilistic object-centric learning, with practical implications for controllable scene generation and robust downstream reasoning.

Abstract

Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.
Paper Structure (34 sections, 9 equations, 37 figures, 2 tables)

This paper contains 34 sections, 9 equations, 37 figures, 2 tables.

Figures (37)

  • Figure 1: Architecture of DISA. The input image (top-left) is first fed through a Sobel filter to partly remove texture information (bottom-left). This filtered image is then encoded and passed through an Invariant Slot Attention biza2023invariant module to extract object-centric shape ($\text{Shape}_i$), position ($\text{P}_i$), and scale ($\text{S}_i$) vectors (bottom-middle). This information is used to decode a mask for each object (bottom-right). The initial input image is encoded (top-left) and then combined with the object attention masks from the ISA module to produce a texture vector ($\text{Texture}_i$) per object (top-middle). These texture representations are combined with their associated shape, position, and scale to decode the textures of the objects (top-right). Note that we need to add the shape, location, and scale information as the texture decoding should fit the already predicted masks. Finally, the decoded masks and textures are combined into an image reconstruction (right).
  • Figure 2: Illustration of the two property prediction tasks that we propose to quantitatively study the degree of texture and shape disentanglement in the representations of DISA. (a) Prediction of an object property based on the associated components. For instance, the color from the texture-related latent features. (a) Inverse property prediction task, where properties are predicted based on the "wrong" part of the object representation, e.g., the color from the shape-related components.
  • Figure 3: Quantitative results of DISA on the regular and inverse property prediction tasks. The predicted properties are shape, color, and material, which are all categorical variables except for the color on Multi-dSprites. With categorical variables, the prediction accuracy is employed and compared with a baseline random guess. With the numerical one, the R$^2$ score is shown. We report mean and stddev over 3 seeds. On Tetrominoes and Multi-dSprites, DISA correctly encodes texture and shape information into two non-overlapping subsets of its latent space dimensions. On CLEVR6 and CLEVRTex, part of the shape and texture information leaks into incorrect components.
  • Figure 4: Qualitative results showing the position and scale disentanglement on DISA. Given an input image (first column), we select two objects and modify their scale/position factors incrementally over 4 steps (second to fifth columns). Texture and shape are consistent across the steps, while position and scale can be independently varied correctly, suggesting that the models can confine position and scale information in their respective components. The visible limitations are displayed by ISA as well, e.g., the missing parts of an object after moving another object covering it or even the inconsistent masks produced when incrementing the scale over a certain point. Additionally, on Tetrominoes, scaling an object does not yield completely wrong results, which is not trivial as the model only sees shapes of fixed size during training.
  • Figure 5: Compositional capabilities of DISA on (a) Tetrominoes and (b) CLEVR6. The central row contains the input image and the predicted object masks. At the top, the reconstructed image and object textures are shown (empty slots excluded), while the bottom row presents the reconstruction and object textures after interchanging texture vectors between objects. DISA shows strong compositional generalization, enabling reliable transferring of textures between objects while preserving shape, position, and scale information. Moreover, mask predictions are highly accurate with clear background separation.
  • ...and 32 more figures