Self-Supervised Learning of Color Constancy

Markus R. Ernst; Francisco M. López; Arthur Aubret; Roland W. Fleming; Jochen Triesch

Self-Supervised Learning of Color Constancy

Markus R. Ernst, Francisco M. López, Arthur Aubret, Roland W. Fleming, Jochen Triesch

TL;DR

This work investigates how color constancy (CC) could develop via self-supervised learning that exploits temporal illumination changes. It introduces the Color Constancy Cubes (C3R) dataset and a time-contrastive learning framework (SimCLR-TT) to learn illumination-invariant representations, formalized by a contrastive loss $\mathcal{L}$ with $\tau=1$. After training, a linear probe on frozen features demonstrates CC by accurately predicting object color under varying lighting, outperforming a color-jitter baseline and revealing emergent color-based clustering in the learned latent space. The study suggests a plausible developmental mechanism for CC, highlights the role of temporal structure and context (e.g., ground plane), and discusses limitations and avenues for extending to more realistic scenes and joint encodings of color, shape, and viewpoint.

Abstract

Color constancy (CC) describes the ability of the visual system to perceive an object as having a relatively constant color despite changes in lighting conditions. While CC and its limitations have been carefully characterized in humans, it is still unclear how the visual system acquires this ability during development. Here, we present a first study showing that CC develops in a neural network trained in a self-supervised manner through an invariance learning objective. During learning, objects are presented under changing illuminations, while the network aims to map subsequent views of the same object onto close-by latent representations. This gives rise to representations that are largely invariant to the illumination conditions, offering a plausible example of how CC could emerge during human cognitive development via a form of self-supervised learning.

Self-Supervised Learning of Color Constancy

TL;DR

with

. After training, a linear probe on frozen features demonstrates CC by accurately predicting object color under varying lighting, outperforming a color-jitter baseline and revealing emergent color-based clustering in the learned latent space. The study suggests a plausible developmental mechanism for CC, highlights the role of temporal structure and context (e.g., ground plane), and discusses limitations and avenues for extending to more realistic scenes and joint encodings of color, shape, and viewpoint.

Abstract

Paper Structure (2 sections, 1 equation, 5 figures)

This paper contains 2 sections, 1 equation, 5 figures.

Self-supervised training
Evaluation

Figures (5)

Figure 1: Overview. A. A central cube lies on a ground plane and is illuminated by up to 8 spotlights arranged on a circle. B. Colors of the cubes in our dataset in HSV space (50 objects, $S=0.5, V=1$). C. The seven different colors (red, green, blue, cyan, yellow, magenta, white) used for the spotlights.
Figure 2: The temporal structure of the C3R data set. Each of the eight spotlights is randomly assigned a color and luminosity. The resulting illumination rapidly changes over time and gives different impressions of the colored (blue) cube in a sequence. The algorithm shapes the latent representation of impressions in temporal vincinity (square brackets) to be alike.
Figure 3: The self-supervised learning framework used for this study. Two successive images are encoded by a neural network $f(\cdot)$ to yield a hidden representation $h$ that is used for downstream classification tasks. The network is trained by projecting the hidden representation using a projection head $g(\cdot)$ and maximizing agreement in the resulting space $z$.
Figure 4: Overall results. A. Learning curve of our proposed method (solid) compared to a supervised baseline (dashed). Horizontal line (dotted) shows raw pixel accuracy. B. Downstream classification of object colors and lighting at different layers of the network hierarchy for our approach vs. object accuracy from pure color jittering (red). C. PaCMAP visualization of the representation $h$ evolving during training. D. Representation at different points of the network hierarchy after training has completed (epoch = 100). Error bars and envelopes in A and B depict standard deviation from five independent runs.
Figure 5: Linear weights visualized in RGB space for raw pixel classification after 200 epochs of training. Brighter means higher value. Colored underline depicts the true color of each object.

Self-Supervised Learning of Color Constancy

TL;DR

Abstract

Self-Supervised Learning of Color Constancy

Authors

TL;DR

Abstract

Table of Contents

Figures (5)