Self-Organising Neural Discrete Representation Learning à la Kohonen

Kazuki Irie; Róbert Csordás; Jürgen Schmidhuber

Self-Organising Neural Discrete Representation Learning à la Kohonen

Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber

TL;DR

The paper revisits Kohonen's Self-Organising Maps (KSOM) as the learning rule for Vector Quantisation in VQ-VAEs, proposing Kohonen-VAEs (KSOM-based codebooks) as a robust alternative to EMA-VQ. KSOM introduces grid-based neighbourhood updates that yield topologically organized discrete representations and often faster early convergence, with strong robustness to initialisation and update schemes. Across CIFAR-10, ImageNet, and CelebA-HQ/AFHQ, KSOM achieves comparable final reconstruction quality to carefully tuned EMA-VQ while providing improved stability and interpretable codebook topology; visualisations reveal organised islands in the codebooks and continuity under perturbations of discrete latent indices. The work offers practical recommendations, demonstrates straightforward integration into existing VQ-VAE frameworks, and highlights the potential of KSOM for robust discrete latent learning in modern generative models.

Abstract

Unsupervised learning of discrete representations in neural networks (NNs) from continuous ones is essential for many modern applications. Vector Quantisation (VQ) has become popular for this, in particular in the context of generative models, such as Variational Auto-Encoders (VAEs), where the exponential moving average-based VQ (EMA-VQ) algorithm is often used. Here, we study an alternative VQ algorithm based on Kohonen's learning rule for the Self-Organising Map (KSOM; 1982). EMA-VQ is a special case of KSOM. KSOM is known to offer two potential benefits: empirically, it converges faster than EMA-VQ, and KSOM-generated discrete representations form a topological structure on the grid whose nodes are the discrete symbols, resulting in an artificial version of the brain's topographic map. We revisit these properties by using KSOM in VQ-VAEs for image processing. In our experiments, the speed-up compared to well-configured EMA-VQ is only observable at the beginning of training, but KSOM is generally much more robust, e.g., w.r.t. the choice of initialisation schemes.

Self-Organising Neural Discrete Representation Learning à la Kohonen

TL;DR

Abstract

Paper Structure (22 sections, 13 equations, 8 figures, 5 tables)

This paper contains 22 sections, 13 equations, 8 figures, 5 tables.

Introduction
Background: Kohonen Maps
(Online) Algorithm
Batch Algorithm & Relation to K-means
Topographical Maps in the Brain as Motivation
Alternative VQ in VQ-VAEs
Background: VQ-VAEs
Kohonen-VAEs
Initialisation & Updates of EMAs
Experiments
Sensitivity of the baseline EMA-VQ
Reconstruction Performance and Convergence Speed
Topologically Ordered Discrete Representations
Discussion
Conclusion
...and 7 more sections

Figures (8)

Figure 1: (a) Illustration of hard neighbourhoods (Eq. \ref{['eq:hard']}) in the 2D case. This is a 6x4 grid with $K=24$ nodes. Considering the left-bottom corner node as the origin (0, 0), the eight neighbours of the node (2, 2) are highlighted. (b) and (c): Illustration of Gaussian neighbourhoods in the 2D case with shrinking (Eq. \ref{['eq:gauss']}) at two different stages of training.
Figure 2: Evolution of validation reconstruction loss on CIFAR-10 for baseline EMA-VQ with different initialisations $N_k^{(0)}$ in Eq. \ref{['eq:denominator']}.
Figure 3: A visualization of codebook ($K=512$) of VQ-VAEs trained on CIFAR10 with (a) EMA-VQ or (b) KSOM. The codebook of EMA-VQ obviously has no structure but serves as a reference. Similar visualisations for VQ-VAE-2 trained on ImageNet can be found in Appendix \ref{['app:vis']}.
Figure 4: Effects of perturbations to the discrete latent code on reconstructed images. For each image, the top row shows the results for KSOM, and the bottom row shows those for EMA-VQ. "Offset" indicates the offset added to the indices of the latent representations.
Figure 5: Evolution of perplexity (codebook utilisation) as a function of training iterations. The codebook size is $K=512$ in all cases. "Baseline" is the standard EMA-VQ. "Hard" and "Gaussian" indicate the corresponding neighbourhood type for KSOM.
...and 3 more figures

Theorems & Definitions (2)

remark thmcounterremark: Relation to K-means
remark thmcounterremark: Relation to EMA-VQ

Self-Organising Neural Discrete Representation Learning à la Kohonen

TL;DR

Abstract

Self-Organising Neural Discrete Representation Learning à la Kohonen

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (2)