Table of Contents
Fetching ...

Autoencoding Conditional Neural Processes for Representation Learning

Victor Prokhorov, Ivan Titov, N. Siddharth

TL;DR

This work introduces PPS-VAE, a variational framework that learns a partial pixel specification (PPS) as the context for Conditional Neural Processes (CNPs). By coupling an abstractive latent a with a learnable context set, PPS-VAE enables CNPs to fit images more accurately and yields context points that encode meaningful object boundaries, interiors, and backgrounds. The approach demonstrates improved log-likelihoods, informative PPS visualizations, and classifier-based probes that show the learned PPS capture class-relevant information both in-distribution and under distribution shift. Moreover, PPS-VAE scales to larger images via tile-based encoding and can adjust context capacity at inference, offering a flexible, scalable path toward learning meaningful visual representations. Overall, the method advances representation learning by making context selection a principled, learnable component of generative/imputation models for images, with practical implications for downstream tasks and robustness.

Abstract

Conditional neural processes (CNPs) are a flexible and efficient family of models that learn to learn a stochastic process from data. They have seen particular application in contextual image completion - observing pixel values at some locations to predict a distribution over values at other unobserved locations. However, the choice of pixels in learning CNPs is typically either random or derived from a simple statistical measure (e.g. pixel variance). Here, we turn the problem on its head and ask: which pixels would a CNP like to observe - do they facilitate fitting better CNPs, and do such pixels tell us something meaningful about the underlying image? To this end we develop the Partial Pixel Space Variational Autoencoder (PPS-VAE), an amortised variational framework that casts CNP context as latent variables learnt simultaneously with the CNP. We evaluate PPS-VAE over a number of tasks across different visual data, and find that not only can it facilitate better-fit CNPs, but also that the spatial arrangement and values meaningfully characterise image information - evaluated through the lens of classification on both within and out-of-data distributions. Our model additionally allows for dynamic adaption of context-set size and the ability to scale-up to larger images, providing a promising avenue to explore learning meaningful and effective visual representations.

Autoencoding Conditional Neural Processes for Representation Learning

TL;DR

This work introduces PPS-VAE, a variational framework that learns a partial pixel specification (PPS) as the context for Conditional Neural Processes (CNPs). By coupling an abstractive latent a with a learnable context set, PPS-VAE enables CNPs to fit images more accurately and yields context points that encode meaningful object boundaries, interiors, and backgrounds. The approach demonstrates improved log-likelihoods, informative PPS visualizations, and classifier-based probes that show the learned PPS capture class-relevant information both in-distribution and under distribution shift. Moreover, PPS-VAE scales to larger images via tile-based encoding and can adjust context capacity at inference, offering a flexible, scalable path toward learning meaningful visual representations. Overall, the method advances representation learning by making context selection a principled, learnable component of generative/imputation models for images, with practical implications for downstream tasks and robustness.

Abstract

Conditional neural processes (CNPs) are a flexible and efficient family of models that learn to learn a stochastic process from data. They have seen particular application in contextual image completion - observing pixel values at some locations to predict a distribution over values at other unobserved locations. However, the choice of pixels in learning CNPs is typically either random or derived from a simple statistical measure (e.g. pixel variance). Here, we turn the problem on its head and ask: which pixels would a CNP like to observe - do they facilitate fitting better CNPs, and do such pixels tell us something meaningful about the underlying image? To this end we develop the Partial Pixel Space Variational Autoencoder (PPS-VAE), an amortised variational framework that casts CNP context as latent variables learnt simultaneously with the CNP. We evaluate PPS-VAE over a number of tasks across different visual data, and find that not only can it facilitate better-fit CNPs, but also that the spatial arrangement and values meaningfully characterise image information - evaluated through the lens of classification on both within and out-of-data distributions. Our model additionally allows for dynamic adaption of context-set size and the ability to scale-up to larger images, providing a promising avenue to explore learning meaningful and effective visual representations.
Paper Structure (77 sections, 6 equations, 36 figures, 11 tables, 1 algorithm)

This paper contains 77 sections, 6 equations, 36 figures, 11 tables, 1 algorithm.

Figures (36)

  • Figure 1: (top) The PPS- VAE framework. (bottom) Examples of meaningful context points induced by the encoder.
  • Figure 2: CNP generative model (left yellow); PPS- VAE generative (left) and inference (right) models.
  • Figure 3: Visualisation of the spatial arrangement of the context set for PPS- VAE on three datasets (test images): CLEVR (a,b) and CelA (c,d) and t-ImageNet (f,e). In each figure [a-f] the first row corresponds to the original image, together with the inferred context set denoted by the yellow squares. The second row corresponds to the reconstructed images.
  • Figure 4: Visualisation of PPS for changing of $M$ at inference time. PPS- VAE was pre-trained with $M=128$.
  • Figure 5: Spatial arrangement of the context set for PPS- VAE tiles. Image size is 256x256, with 8x8 tiles.
  • ...and 31 more figures