Table of Contents
Fetching ...

Leveraging Color Channel Independence for Improved Unsupervised Object Detection

Bastian Jäckl, Yannick Metz, Udo Schlegel, Daniel A. Keim, Maximilian T. Fischer

TL;DR

Unsupervised object-centric learning with Slot Attention is hindered by RGB's high channel correlations and lighting sensitivity. The authors introduce composite color spaces that augment the target outputs with additional channels (notably Saturation, forming RGB-S and RGB-SV) while keeping inputs and architectures unchanged. Across five multi-object datasets, these composite targets consistently boost object discovery and factor representation, with RGB-S delivering substantial gains (e.g., strong improvements on Clevrtex) and broad applicability to photorealistic data like Movi-C. The approach requires negligible compute and opens avenues for applying composite color spaces to broader visual learning tasks beyond OCRL, highlighting color representation as a tractable lever for representation quality.

Abstract

Object-centric architectures can learn to extract distinct object representations from visual scenes, enabling downstream applications on the object level. Similarly to autoencoder-based image models, object-centric approaches have been trained on the unsupervised reconstruction loss of images encoded by RGB color spaces. In our work, we challenge the common assumption that RGB images are the optimal color space for unsupervised learning in computer vision. We discuss conceptually and empirically that other color spaces, such as HSV, bear essential characteristics for object-centric representation learning, like robustness to lighting conditions. We further show that models improve when requiring them to predict additional color channels. Specifically, we propose to transform the predicted targets to the RGB-S space, which extends RGB with HSV's saturation component and leads to markedly better reconstruction and disentanglement for five common evaluation datasets. The use of composite color spaces can be implemented with basically no computational overhead, is agnostic of the models' architecture, and is universally applicable across a wide range of visual computing tasks and training types. The findings of our approach encourage additional investigations in computer vision tasks beyond object-centric learning.

Leveraging Color Channel Independence for Improved Unsupervised Object Detection

TL;DR

Unsupervised object-centric learning with Slot Attention is hindered by RGB's high channel correlations and lighting sensitivity. The authors introduce composite color spaces that augment the target outputs with additional channels (notably Saturation, forming RGB-S and RGB-SV) while keeping inputs and architectures unchanged. Across five multi-object datasets, these composite targets consistently boost object discovery and factor representation, with RGB-S delivering substantial gains (e.g., strong improvements on Clevrtex) and broad applicability to photorealistic data like Movi-C. The approach requires negligible compute and opens avenues for applying composite color spaces to broader visual learning tasks beyond OCRL, highlighting color representation as a tractable lever for representation quality.

Abstract

Object-centric architectures can learn to extract distinct object representations from visual scenes, enabling downstream applications on the object level. Similarly to autoencoder-based image models, object-centric approaches have been trained on the unsupervised reconstruction loss of images encoded by RGB color spaces. In our work, we challenge the common assumption that RGB images are the optimal color space for unsupervised learning in computer vision. We discuss conceptually and empirically that other color spaces, such as HSV, bear essential characteristics for object-centric representation learning, like robustness to lighting conditions. We further show that models improve when requiring them to predict additional color channels. Specifically, we propose to transform the predicted targets to the RGB-S space, which extends RGB with HSV's saturation component and leads to markedly better reconstruction and disentanglement for five common evaluation datasets. The use of composite color spaces can be implemented with basically no computational overhead, is agnostic of the models' architecture, and is universally applicable across a wide range of visual computing tasks and training types. The findings of our approach encourage additional investigations in computer vision tasks beyond object-centric learning.

Paper Structure

This paper contains 42 sections, 16 figures, 12 tables.

Figures (16)

  • Figure 1: Exemplary scene from the Clevrtex and Movi-C dataset. We plot the original image and object masks with individual color channels on a heatmap. We show RGB's Red, Green, and Blue channels and HSV's Hue, Saturation, and Value. The objects are distinguishable from the background and other objects in all heatmaps. Contrary to the RGB channels, the HSV channels are uncorrelated.
  • Figure 2: Two qualitative samples for SA with a RGB-HSV reconstruction on Clevr. We observe that the reconstruction and slot assignment are degenerated, with scenes split into many different stripe-like slots. For models trained on RGB, RGB-S, RGB-SV, the mask reconstruction almost perfectly matches the ground-truth mask (for more details, see \ref{['fig:clevr_scene']})
  • Figure 3: We show reconstructions and masks of models trained on Multishapenet, Clevrtex, and Movi-C compared to their ground truth. While the RGB space does not confidently segment objects from each other and from the background, RGB-S achieves close-to-perfect scene segmentations. On Movi-C, RGB models degenerate to represent spatial areas instead of objects. The RGB-S slots mostly attend to semantic objects, but objects are often split into multiple slots.
  • Figure 4: We report Slot Learning and Disentanglement results for the CNN and ResNet Variants of the Clevrtex dataset. The left figure shows the average precision of a linear predictor trained to map from slot representations to underlying object properties. The middle figure shows results for a shallow non-linear predictor. The right figure shows the Informativeness, Disentanglement, and Completeness of slot representations EastWood2018. The composite color spaces consistently outperform the RGB space for all considered metrics. The HSV space even outperforms the composite color spaces. Although the composite color spaces significantly improve the representative power of underlying object factors, they still show low performance in disentanglement and completeness.
  • Figure 5: We show color statistics of our newly generated Clevrtex test set under different lighting conditions. The distances of the lights are denoted with the variable $L$, distancing all light sources. While the distribution of Hue and Saturation mostly remains consistent, all other color channels show immense discrepancies.
  • ...and 11 more figures