Table of Contents
Fetching ...

Latent Representation Matters: Human-like Sketches in One-shot Drawing Tasks

Victor Boutin, Rishav Mukherji, Aditya Agrawal, Sabine Muzellec, Thomas Fel, Thomas Serre, Rufin VanRullen

TL;DR

It is demonstrated that LDMs with redundancy reduction and prototype-based regularizations produce near-human-like drawings (regarding both samples' recognizability and originality) -- better mimicking human perception (as evaluated psychophysically).

Abstract

Humans can effortlessly draw new categories from a single exemplar, a feat that has long posed a challenge for generative models. However, this gap has started to close with recent advances in diffusion models. This one-shot drawing task requires powerful inductive biases that have not been systematically investigated. Here, we study how different inductive biases shape the latent space of Latent Diffusion Models (LDMs). Along with standard LDM regularizers (KL and vector quantization), we explore supervised regularizations (including classification and prototype-based representation) and contrastive inductive biases (using SimCLR and redundancy reduction objectives). We demonstrate that LDMs with redundancy reduction and prototype-based regularizations produce near-human-like drawings (regarding both samples' recognizability and originality) -- better mimicking human perception (as evaluated psychophysically). Overall, our results suggest that the gap between humans and machines in one-shot drawings is almost closed.

Latent Representation Matters: Human-like Sketches in One-shot Drawing Tasks

TL;DR

It is demonstrated that LDMs with redundancy reduction and prototype-based regularizations produce near-human-like drawings (regarding both samples' recognizability and originality) -- better mimicking human perception (as evaluated psychophysically).

Abstract

Humans can effortlessly draw new categories from a single exemplar, a feat that has long posed a challenge for generative models. However, this gap has started to close with recent advances in diffusion models. This one-shot drawing task requires powerful inductive biases that have not been systematically investigated. Here, we study how different inductive biases shape the latent space of Latent Diffusion Models (LDMs). Along with standard LDM regularizers (KL and vector quantization), we explore supervised regularizations (including classification and prototype-based representation) and contrastive inductive biases (using SimCLR and redundancy reduction objectives). We demonstrate that LDMs with redundancy reduction and prototype-based regularizations produce near-human-like drawings (regarding both samples' recognizability and originality) -- better mimicking human perception (as evaluated psychophysically). Overall, our results suggest that the gap between humans and machines in one-shot drawings is almost closed.
Paper Structure (49 sections, 33 equations, 24 figures, 2 tables, 4 algorithms)

This paper contains 49 sections, 33 equations, 24 figures, 2 tables, 4 algorithms.

Figures (24)

  • Figure 1: Latent Diffusion Models stack a diffusion model (orange) on top of an Auto-Encoder (green).
  • Figure 2: Samples from LDMs w/ different regularizers. The LDMs correspond to the larger data points in Fig. \ref{['fig:fig1']}.
  • Figure 3: Effect of increasing the regularization weights on the originality vs recognizability framework (QuickDraw-FS dataset). Each data point represents an LDM trained with different values of regularization weights ($\beta$). The curves represent the parametric fits, oriented in the direction of an increase of $\beta$. a): For the LDMs with "standard" regularizers, the $\beta$ is applied on the KL ($\mathcal{L}_{KL}$ in Eq. \ref{['eq:l_reg_kl']}) or on the VQ regularizers ($\mathcal{L}_{VQ}$ in Eq. \ref{['eq:l_reg_vq']}). b): For the supervised regularizers, the $\beta$ is applied on the CL ($\mathcal{L}_{CL}$ in Eq. \ref{['eq:l_reg_cls']}) or on the prototype-based regularizers ($\mathcal{L}_{PR}$ in Eq. \ref{['eq:l_reg_proto']}). c): For the contrastive regularizers, the $\beta$ is applied on the SimCLR ($\mathcal{L}_{SimCLR}$ in Eq. \ref{['eq:app_InfoNCE']}) or on the Barlow regularizers ($\mathcal{L}_{Bar}$ in Eq. \ref{['eq:app_bar']}). See \ref{['app:reg_impact']} for more information on the range of $\beta$ we have explored for each regularizer. Larger data points indicate models whose performance is closer to that of humans for each type of regularization. For comparison, we include an LDM leveraging a non-regularized RAE (hexagon marker) and a diffusion model trained directly on the pixel space (cross marker). The human performance corresponds to the recognizability and originality computed on human drawings (shown with a grey star).
  • Figure 4: Combined effect of the regularization weights on the originality vs recognizability framework (QuickDraw-FS dataset). Each data point represents an LDM trained with a combination of $2$ different regularizers. All combinations include the prototype-based regularizers. The curves represent the parametric fits, oriented in the direction of an increase of $\beta$. a): Barlow and prototype-based regularizers applied either separately (plain lines) or in combination (dashed-line). When applied in combinations, only the weight of the prototype-based regularizer is modified (with $\beta=30$ for Barlow). b): SimCLR and prototype-based regularizers. When applied in combinations, only the weight of the prototype-based regularizer is modified, the SimCLR is set to $\beta=1$. c): KL and prototype-based regularizers. When applied in combinations, only the weight of the prototype-based regularizer is modified, the KL is set to $\beta=1e-3$. d): VQ and prototype-based regularizers. When applied in combinations, only the weight of the prototype-based regularizer is modified, the VQ is set to $\beta=20$. See caption in Fig. \ref{['fig:fig1']}.
  • Figure 5: Feature importance maps comparison.a) The visualizations include feature importance maps for humans (top row) and LDMs (six bottom rows). All the maps are overlaid on exemplars. Hot vs. cold pixels show image locations that are more vs. less important. Maps for humans were computed using psychophysical data from boutin2023diffusion. For the LDMs, they are obtained for each category by averaging $\phi(\mathbf{x},\mathbf{y})$ (see Eq. \ref{['eq:main:ldm_attrib']}) over $10$ different image variations ($\mathbf{x}$) belonging to the same category. The models' maps are computed on the more human-like LDMs for each regularization (larger data points in Fig. \ref{['fig:fig2']}). b) Spearman's rank correlation coefficient between humans and LDMs feature importance maps. The error bar is computed as the standard deviation of the Spearman coefficients over all categories ($25$ in total). Stars indicate the p-value ($\star\!\star\!\star: p<10^{-3}$ and $\star : p<5.10^{-2}$) of pair-wise statistical test between models (Wilcoxon signed-rank test, see \ref{['App:LDM_wixcox']}). The black line corresponds to an LDM without any regularization. The dashed line is the human consistency ($0.88$), it quantifies how much two populations of humans agree with each other on feature importance maps (see \ref{['App:CLickMe_Viz']} for details on the human consistency computation).
  • ...and 19 more figures