What We Don't C: Representations for scientific discovery beyond VAEs
Brian Rogers, Micah Bowles, Chris J. Lintott, Steve Croft
TL;DR
The paper tackles the challenge of accessing meaningful features in high-dimensional scientific data that are not explicitly captured by standard variational models. It proposes latent_flow_matching with classifier-free guidance to disentangle conditioning information from residual latent structure, linking the VAE latent space to a base distribution via a velocity field and a conditioning dropout mechanism. Through experiments on 2D Gaussians, colored MNIST, and Galaxy10_DECaLS, it demonstrates that the method can selectively remove or preserve conditioning signals, enable style transfer, and isolate class-specific content in real images, thereby revealing information hidden from conventional representations. This approach offers a computationally tractable pathway for scientific discovery, enabling researchers to explore What We Don't Capture and repurpose latent representations without full retraining when conditioning information changes.
Abstract
Accessing information in learned representations is critical for scientific discovery in high-dimensional domains. We introduce a novel method based on latent flow matching with classifier-free guidance that disentangles latent subspaces by explicitly separating information included in conditioning from information that remains in the residual representation. Across three experiments -- a synthetic 2D Gaussian toy problem, colored MNIST, and the Galaxy10 astronomy dataset -- we show that our method enables access to meaningful features of high dimensional data. Our results highlight a simple yet powerful mechanism for analyzing, controlling, and repurposing latent representations, providing a pathway toward using generative models for scientific exploration of what we don't capture, consider, or catalog.
