PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

Hao Lu; Onur C. Koyun; Yongxin Guo; Zhengjie Zhu; Abbas Alili; Metin Nafi Gurcan

PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan

TL;DR

This work tackles the non-differentiability and collapse risks of vector quantization in deep generative models by replacing VQ with an online PCA layer learned via Oja’s rule. The PCA-VAE framework replaces the discrete codebook with a differentiable, orthogonal latent projection $\hat{\mathbf{h}} = C C^{\top} (\mathbf{h}-\boldsymbol{\mu}) + \boldsymbol{\mu}$, with PCA parameters updated outside the standard backpropagation. Experiments on CelebA-HQ show PCA-VAE achieves reconstruction quality competitive with or surpassing VQ-based methods while using 10×–100× fewer latent bits, and reveals interpretable, variance-ordered latent axes (e.g., illumination, pose, gender cues). The results suggest PCA as a principled alternative to vector quantization, offering stability, bit-efficiency, and semantic structure with broad applicability beyond discrete tokenizers.

Abstract

Vector-quantized autoencoders deliver high-fidelity latents but suffer inherent flaws: the quantizer is non-differentiable, requires straight-through hacks, and is prone to collapse. We address these issues at the root by replacing VQ with a simple, principled, and fully differentiable alternative: an online PCA bottleneck trained via Oja's rule. The resulting model, PCA-VAE, learns an orthogonal, variance-ordered latent basis without codebooks, commitment losses, or lookup noise. Despite its simplicity, PCA-VAE exceeds VQ-GAN and SimVQ in reconstruction quality on CelebAHQ while using 10-100x fewer latent bits. It also produces naturally interpretable dimensions (e.g., pose, lighting, gender cues) without adversarial regularization or disentanglement objectives. These results suggest that PCA is a viable replacement for VQ: mathematically grounded, stable, bit-efficient, and semantically structured, offering a new direction for generative models beyond vector quantization.

PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

TL;DR

, with PCA parameters updated outside the standard backpropagation. Experiments on CelebA-HQ show PCA-VAE achieves reconstruction quality competitive with or surpassing VQ-based methods while using 10×–100× fewer latent bits, and reveals interpretable, variance-ordered latent axes (e.g., illumination, pose, gender cues). The results suggest PCA as a principled alternative to vector quantization, offering stability, bit-efficiency, and semantic structure with broad applicability beyond discrete tokenizers.

Abstract

Paper Structure (26 sections, 16 equations, 7 figures)

This paper contains 26 sections, 16 equations, 7 figures.

Introduction
Related Work
VQ-VAE Applications
VQ-VAE Variants
Online and Streaming PCA
Methods
Preliminary Knowledge: Principal Component Analysis
Online PCA
Notation and setup
Running-mean centering with geometric -fade
Online subspace learning (Oja-type update)
Symmetric re-orthonormalization
Objective and optimality (link to PCA)
Integration of PCA Layer into VAE
Architecture overview
...and 11 more sections

Figures (7)

Figure 1: Overall architecture of the proposed PCA-VAE. The encoder extracts latent features $\mathbf{h}$ from the input image $\mathbf{x}$. The PCA layer performs an online orthogonal projection $\hat{\mathbf{h}} = C C^{\top} (\mathbf{h} - \boldsymbol{\mu}) + \boldsymbol{\mu}$, where $C$ and $\boldsymbol{\mu}$ are updated via Oja's rule and $r$-fade averaging but treated as stop-gradient variables during VAE backpropagation. The quantized latent $\hat{\mathbf{h}}$ is then decoded to reconstruct $\hat{\mathbf{x}}$. The PCA layer supports both global (single-vector) and spatial (multi-patch) latent configurations, each with its own PCA basis.
Figure 2: Normalized reconstruction performance. We compare PCA-VAE (16$\times$16, 100% bases) with VQGAN RN3, SimVQ RN13, VQ-VAE razavi2019generating, and a VAE RN4 baseline. All VQ models use 16$\times$16 latents and 8,912 codebook tokens. Metrics are normalized (PSNR/SSIM min–max, LPIPS/rFID reverse) so that higher is better.
Figure 3: Scaling behavior of PCA-VAE with respect to the fraction of principal bases used (1%--100%) under different latent grid resolutions (1$\times$1, 4$\times$4, 8$\times$8, 16$\times$16). The red horizontal line marks the best VQ baseline (SimVQ RN13, 16$\times$16).
Figure 4: Latent bit–budget curves comparing PCA–VAE to VQGAN RN3, SimVQ RN13, VQ-VAE razavi2019generating, and AutoencoderKL RN4. PCA–VAE achieves higher reconstruction quality per bit across PSNR, SSIM, LPIPS, and rFID, often matching VQ performance with 10$\times$-100$\times$ fewer latent bits.
Figure 5: Latent semantics in PCA-VAE. We modify a single latent coefficient within the range $[-2, 2]$ while fixing the others. Each principal direction produces a coherent semantic transition (illumination, head pose, facial structure, shading, hair density), demonstrating interpretable continuous latent axes.
...and 2 more figures

PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

TL;DR

Abstract

PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

Authors

TL;DR

Abstract

Table of Contents

Figures (7)