Table of Contents
Fetching ...

Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder

Krzysztof Krawiec, Antoni Nowinowski

TL;DR

The paper tackles the problem of obtaining principled, high-level scene interpretations from visual data by introducing Disentangling Visual Priors (DVP), a neurosymbolic framework that leverages a domain-specific language to encode priors on object shape, appearance, categorization, and geometric transforms. A Perception module maps images to a latent vector $z$, which parameterizes a DSL program that generates a symbolic Scene; a differentiable Renderer then compares the rendering to the input, enabling end-to-end training as a compositional autoencoder. DVP demonstrates disentanglement across shape, color, pose, transform, and category, learns from small data, and generalizes to unseen shapes, with learnable shape prototypes and Elliptic Fourier Descriptors providing interpretable priors. The approach yields an explainable, modular image-formation pipeline and shows robustness to noise, offering a path toward scalable, outside-distribution generalization in scene understanding.

Abstract

Contemporary deep learning architectures lack principled means for capturing and handling fundamental visual concepts, like objects, shapes, geometric transforms, and other higher-level structures. We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation, including object shape, appearance, categorization, and geometric transforms. We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network. When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content and trained via auto-association with gradient. We confront our approach with a baseline method on a synthetic benchmark and demonstrate its capacity to disentangle selected aspects of the image formation process, learn from small data, correct inference in the presence of noise, and out-of-sample generalization.

Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder

TL;DR

The paper tackles the problem of obtaining principled, high-level scene interpretations from visual data by introducing Disentangling Visual Priors (DVP), a neurosymbolic framework that leverages a domain-specific language to encode priors on object shape, appearance, categorization, and geometric transforms. A Perception module maps images to a latent vector , which parameterizes a DSL program that generates a symbolic Scene; a differentiable Renderer then compares the rendering to the input, enabling end-to-end training as a compositional autoencoder. DVP demonstrates disentanglement across shape, color, pose, transform, and category, learns from small data, and generalizes to unseen shapes, with learnable shape prototypes and Elliptic Fourier Descriptors providing interpretable priors. The approach yields an explainable, modular image-formation pipeline and shows robustness to noise, offering a path toward scalable, outside-distribution generalization in scene understanding.

Abstract

Contemporary deep learning architectures lack principled means for capturing and handling fundamental visual concepts, like objects, shapes, geometric transforms, and other higher-level structures. We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation, including object shape, appearance, categorization, and geometric transforms. We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network. When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content and trained via auto-association with gradient. We confront our approach with a baseline method on a synthetic benchmark and demonstrate its capacity to disentangle selected aspects of the image formation process, learn from small data, correct inference in the presence of noise, and out-of-sample generalization.
Paper Structure (8 sections, 2 equations, 5 figures, 3 tables)

This paper contains 8 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: DVP architecture. Perception encodes an image into a latent vector $z$. Program maps $z$ to a Scene. Renderer renders the Scene as a raster image.
  • Figure 2: The reconstructions for 6 test-set examples that DVP fared worst on in terms of MSE, for models trained on 5% (a) and 1% (b) of the training set. Row-wise: the input image; the output of MONet0.9; the output of DVP-P0.9.
  • Figure 3: The impact of introducing noise to the test-set on the metrics. Noise was sampled from normal distribution with mean $0$ and standard deviation $\sigma^2$.
  • Figure 4: Reconstructions for out-of-sample objects created by replacing shapes in the first 10 testing examples with hourglass, triangle, and L-shape. Row-wise: input scene; the output of DVP-D0.9; the output of DVP-Dsmall.
  • Figure 5: The prototypes learned by DVP-P0.9 trained on 100% (top), 5% (middle), and 1% (bottom) of training data. Color represents the overall impact, i.e. the normalized sum of weights assigned to each prototype embedding by the Prototype function, estimated from the test-set. The order of prototypes is irrelevant.