Autoencoding Conditional Neural Processes for Representation Learning
Victor Prokhorov, Ivan Titov, N. Siddharth
TL;DR
This work introduces PPS-VAE, a variational framework that learns a partial pixel specification (PPS) as the context for Conditional Neural Processes (CNPs). By coupling an abstractive latent a with a learnable context set, PPS-VAE enables CNPs to fit images more accurately and yields context points that encode meaningful object boundaries, interiors, and backgrounds. The approach demonstrates improved log-likelihoods, informative PPS visualizations, and classifier-based probes that show the learned PPS capture class-relevant information both in-distribution and under distribution shift. Moreover, PPS-VAE scales to larger images via tile-based encoding and can adjust context capacity at inference, offering a flexible, scalable path toward learning meaningful visual representations. Overall, the method advances representation learning by making context selection a principled, learnable component of generative/imputation models for images, with practical implications for downstream tasks and robustness.
Abstract
Conditional neural processes (CNPs) are a flexible and efficient family of models that learn to learn a stochastic process from data. They have seen particular application in contextual image completion - observing pixel values at some locations to predict a distribution over values at other unobserved locations. However, the choice of pixels in learning CNPs is typically either random or derived from a simple statistical measure (e.g. pixel variance). Here, we turn the problem on its head and ask: which pixels would a CNP like to observe - do they facilitate fitting better CNPs, and do such pixels tell us something meaningful about the underlying image? To this end we develop the Partial Pixel Space Variational Autoencoder (PPS-VAE), an amortised variational framework that casts CNP context as latent variables learnt simultaneously with the CNP. We evaluate PPS-VAE over a number of tasks across different visual data, and find that not only can it facilitate better-fit CNPs, but also that the spatial arrangement and values meaningfully characterise image information - evaluated through the lens of classification on both within and out-of-data distributions. Our model additionally allows for dynamic adaption of context-set size and the ability to scale-up to larger images, providing a promising avenue to explore learning meaningful and effective visual representations.
