Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs
Nicholas Watters, Loic Matthey, Christopher P. Burgess, Alexander Lerchner
TL;DR
The paper addresses disentangled representation learning in VAEs by proposing the Spatial Broadcast decoder, which tiles a latent vector across a spatial grid and appends fixed coordinate channels before a shallow unstrided convolutional decoder. This architectural prior enables the model to separate positional from non-positional features without supervision, improving disentanglement, reconstruction, and generalization, especially for small objects. Extensive experiments on colored sprites, Chairs, and 3D Object-in-Room show consistent gains in MIG and qualitative disentanglement, along with a simple latent-space visualization technique that clarifies latent geometry. The method is complementary to state-of-the-art disentangling approaches like FactorVAE and $eta$-VAE and can be integrated to boost their performance with minimal hyperparameter tuning.
Abstract
We present a simple neural rendering architecture that helps variational autoencoders (VAEs) learn disentangled representations. Instead of the deconvolutional network typically used in the decoder of VAEs, we tile (broadcast) the latent vector across space, concatenate fixed X- and Y-"coordinate" channels, and apply a fully convolutional network with 1x1 stride. This provides an architectural prior for dissociating positional from non-positional features in the latent distribution of VAEs, yet without providing any explicit supervision to this effect. We show that this architecture, which we term the Spatial Broadcast decoder, improves disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic benefit when applied to datasets with small objects. We also emphasize a method for visualizing learned latent spaces that helped us diagnose our models and may prove useful for others aiming to assess data representations. Finally, we show the Spatial Broadcast Decoder is complementary to state-of-the-art (SOTA) disentangling techniques and when incorporated improves their performance.
