Table of Contents
Fetching ...

Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

Jamie S. J. Stirling, Noura Al-Moubayed, Hubert P. H. Shum

Abstract

Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.

Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

Abstract

Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.

Paper Structure

This paper contains 18 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Visualisation of (approximately) smooth interpolations on CelebA 64x64 (source images to left and right). The discrete and permutation-invariant nature of the latents allows us to generate multiple, equally plausible paths between two images.
  • Figure 2: The encoder and decoder components of the permutation-invariant autoencoder (quantization step omitted for clarity). Positional embeddings are not applied to the latent codes before decoding: as a result, the decoder is invariant to permutation of the latent codes.
  • Figure 3: Matching quantization: Our novel approach to vector quantization (right) ensures that no two embedding vectors are mapped to the same codebook element in a given image, effectively minimising redundancy in the discrete representation.
  • Figure 4: Information capacity (in bits) in terms of representation length $L$ of three approaches with codebook size $K=4096$: standard VQ-VAE with no permutation invariance; PI-VQ with nearest neighbour quantization ($K_{\mathrm{img}}=49$); PI-VQ with proposed matching quantization.
  • Figure 5: Logistic regression accuracy with error bars (5-way cross-validated) for predicting ground-truth FFHQ annotations based on learned permutation-invariant representations. For each attribute, we compare against the baseline accuracy (always predicting the majority class).