Table of Contents
Fetching ...

Learning Interpretable Low-dimensional Representation via Physical Symmetry

Xuanjie Liu, Daniel Chin, Yichen Huang, Gus Xia

TL;DR

This work introduces SPS, a self-supervised framework that enforces physical symmetry on the latent dynamics of time-series to learn interpretable, low-dimensional representations. By requiring the prior dynamics $R$ to be equivariant to transformations $S$ in latent space, SPS recovers human-aligned factors such as a linear pitch in music and 3D Cartesian coordinates from monocular video, without domain-specific labels. A novel counterfactual representation augmentation mechanism expands training data in the latent space, boosting sample efficiency and aiding interpretability, even under incorrect symmetry assumptions. The approach is demonstrated on two domains (music and video) and extended with SPS+ to enable content–style disentanglement, indicating practical impact for learning compact, human-understandable representations in diverse time-series settings.

Abstract

We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord and texture. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles give rise to interpretable representations, especially low-dim factors that agree with human perception. In this study, we take inspiration from modern physics and use physical symmetry as a self consistency constraint for the latent space of time-series data. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to certain group transformations. We show that physical symmetry leads the model to learn a linear pitch factor from unlabelled monophonic music audio in a self-supervised fashion. In addition, the same methodology can be applied to computer vision, learning a 3D Cartesian space from videos of a simple moving object without labels. Furthermore, physical symmetry naturally leads to counterfactual representation augmentation, a new technique which improves sample efficiency.

Learning Interpretable Low-dimensional Representation via Physical Symmetry

TL;DR

This work introduces SPS, a self-supervised framework that enforces physical symmetry on the latent dynamics of time-series to learn interpretable, low-dimensional representations. By requiring the prior dynamics to be equivariant to transformations in latent space, SPS recovers human-aligned factors such as a linear pitch in music and 3D Cartesian coordinates from monocular video, without domain-specific labels. A novel counterfactual representation augmentation mechanism expands training data in the latent space, boosting sample efficiency and aiding interpretability, even under incorrect symmetry assumptions. The approach is demonstrated on two domains (music and video) and extended with SPS+ to enable content–style disentanglement, indicating practical impact for learning compact, human-understandable representations in diverse time-series settings.

Abstract

We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord and texture. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles give rise to interpretable representations, especially low-dim factors that agree with human perception. In this study, we take inspiration from modern physics and use physical symmetry as a self consistency constraint for the latent space of time-series data. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to certain group transformations. We show that physical symmetry leads the model to learn a linear pitch factor from unlabelled monophonic music audio in a self-supervised fashion. In addition, the same methodology can be applied to computer vision, learning a 3D Cartesian space from videos of a simple moving object without labels. Furthermore, physical symmetry naturally leads to counterfactual representation augmentation, a new technique which improves sample efficiency.
Paper Structure (41 sections, 12 equations, 19 figures, 14 tables)

This paper contains 41 sections, 12 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: An illustration of physical symmetry as our inductive bias.
  • Figure 2: An overview of our model. $\textbf{x}_{1:T}$ is fed into the encoder $E$ to obtain the corresponding representation $\textbf{z}_{1:T}$, which is then fed into three different branches yielding three outputs respectively: $\textbf{x}'_{1:T}$, $\hat{\textbf{x}}_{2:T+1}$ and $\tilde{\textbf{x}}_{2:T+1}$. Here, $R$ is the prior model and $S$ is the symmetric operation. The inductive bias of physical symmetry enforces $R$ to be equivaraint with respect to $S$, so $\tilde{\textbf{z}}$ and $\hat{\textbf{z}}$ should be close to each other and so are $\tilde{\textbf{x}}$ and $\hat{\textbf{x}}$.
  • Figure 3: A visualisation of the mapping between the 1D learned factor $\textbf{z}$ and the true pitch, in which a straight lines indicates a better result. In the upper row, models encode notes in the test set to $\textbf{z}$. The $x$ axis shows the true pitch and the $y$ axis shows the learned pitch factor. In the lower row, the $x$ axis traverses the $\textbf{z}$ space. The models decode $\textbf{z}$ to audio clips. We apply YIN to the audio clips to detect the pitch, which is shown by the $y$ axis. In both rows, a linear, noiseless mapping is ideal, and our method performs the best.
  • Figure 4: Two example trajectories from the bouncing ball dataset.
  • Figure 5: A visualisation of latent-space traversal performed on three models: (a) ours, (b) ablation, and (c) baseline, in which we see (a) achieves better linearity and interpretability. Here, row $i$ shows the generated images when changing $z_i$ and keeping $z_{\neq i}=0$, where the $x$ axis varies $z_i$ from $-2 \sigma_z$ to $+2 \sigma_z$. We center and normalise $z$, so that the latent space from different runs is aligned for fair comparison. Specifically, in (a), changing $z_2$ controls the ball's height, and changing $z_1, z_3$ moves the ball parallel to the ground plane. In contrast, the behavior in (b) and (c) are less interpretable.
  • ...and 14 more figures