Table of Contents
Fetching ...

Learning symmetries in datasets

Veronica Sanz

TL;DR

This work addresses how symmetries in data shape the latent representations learned by variational autoencoders (VAEs) and proposes a relevance measure to identify the most meaningful latent directions. By training VAEs on both simple symmetric datasets and physics-like data from electron-positron and proton-proton collisions, the authors show that symmetry constraints lead to latent-space compression and alignment with intrinsic degrees of freedom, consistent with momentum conservation and invariant mass constraints. A simple toy model demonstrates that, under idealized conditions, the latent space naturally aligns with symmetry directions, supporting the intuition that unsupervised generative models can reveal underlying data structure. The findings suggest a practical route to symmetry discovery without supervision and point toward symmetry-aware extensions of generative models for data-driven structure discovery in physics and beyond.

Abstract

We investigate how symmetries present in datasets affect the structure of the latent space learned by Variational Autoencoders (VAEs). By training VAEs on data originating from simple mechanical systems and particle collisions, we analyze the organization of the latent space through a relevance measure that identifies the most meaningful latent directions. We show that when symmetries or approximate symmetries are present, the VAE self-organizes its latent space, effectively compressing the data along a reduced number of latent variables. This behavior captures the intrinsic dimensionality determined by the symmetry constraints and reveals hidden relations among the features. Furthermore, we provide a theoretical analysis of a simple toy model, demonstrating how, under idealized conditions, the latent space aligns with the symmetry directions of the data manifold. We illustrate these findings with examples ranging from two-dimensional datasets with $O(2)$ symmetry to realistic datasets from electron-positron and proton-proton collisions. Our results highlight the potential of unsupervised generative models to expose underlying structures in data and offer a novel approach to symmetry discovery without explicit supervision.

Learning symmetries in datasets

TL;DR

This work addresses how symmetries in data shape the latent representations learned by variational autoencoders (VAEs) and proposes a relevance measure to identify the most meaningful latent directions. By training VAEs on both simple symmetric datasets and physics-like data from electron-positron and proton-proton collisions, the authors show that symmetry constraints lead to latent-space compression and alignment with intrinsic degrees of freedom, consistent with momentum conservation and invariant mass constraints. A simple toy model demonstrates that, under idealized conditions, the latent space naturally aligns with symmetry directions, supporting the intuition that unsupervised generative models can reveal underlying data structure. The findings suggest a practical route to symmetry discovery without supervision and point toward symmetry-aware extensions of generative models for data-driven structure discovery in physics and beyond.

Abstract

We investigate how symmetries present in datasets affect the structure of the latent space learned by Variational Autoencoders (VAEs). By training VAEs on data originating from simple mechanical systems and particle collisions, we analyze the organization of the latent space through a relevance measure that identifies the most meaningful latent directions. We show that when symmetries or approximate symmetries are present, the VAE self-organizes its latent space, effectively compressing the data along a reduced number of latent variables. This behavior captures the intrinsic dimensionality determined by the symmetry constraints and reveals hidden relations among the features. Furthermore, we provide a theoretical analysis of a simple toy model, demonstrating how, under idealized conditions, the latent space aligns with the symmetry directions of the data manifold. We illustrate these findings with examples ranging from two-dimensional datasets with symmetry to realistic datasets from electron-positron and proton-proton collisions. Our results highlight the potential of unsupervised generative models to expose underlying structures in data and offer a novel approach to symmetry discovery without explicit supervision.

Paper Structure

This paper contains 10 sections, 21 equations, 9 figures.

Figures (9)

  • Figure 1: Variational Autoencoder architecture used in this paper. We will vary the size of the input dimension input-dim, and the latent dimension latent-dim depending on the problem.
  • Figure 2: Relevance distribution of the latent variables for the two datasets. In orange, the truly two-dimensional dataset $\mathcal{D}^{\text{2D}}$ and in blue the dataset constrained to a circle $\mathcal{D}^{\text{1D}}$. The latent variables are ordered by decreasing relevance.
  • Figure 3: Distribution of the mean latent activation $\langle z_1 \rangle$ (most relevant latent direction) in the $(x_1, x_2)$ space. Left: truly two-dimensional dataset $\mathcal{D}^{\text{2D}}$. Right: $O(2)$ symmetric dataset $\mathcal{D}^{\text{1D}}$. In the symmetric case, the latent variable is ordered along the circle.
  • Figure 4: Mean latent activations for all the latent dimensions in the $O(2)$ symmetric datase, $\langle z_i \rangle$, $i=1\ldots4$ as a function of $x_1$ (left) and $x_2$ (right).
  • Figure 5: Feynman diagram for the process $e^+ e^- \to \mu^+ \mu^-$ in QED.
  • ...and 4 more figures