Table of Contents
Fetching ...

Operational Latent Spaces

Scott H. Hawley, Austin R. Tackett

TL;DR

The paper addresses building operational latent spaces (OpLaS) in pretrained audio representations to enable semantically meaningful operations. It introduces two self-supervised strategies: a mixing framework that learns an invertible projector $h$ so the sum of embeddings aligns with the embedding of the mix, and FiLMR-based rotations where a rotation matrix $M({\vec{u}}, {\vec{v}})$ (constructed via the Aguilera-Perez method) enables ring-like transformations; The Stargate Problem illustrates ring symmetry and tests the ability of FiLMR to realize forward-step transformations that square matrices struggle to learn. The work presents preliminary, high-level evidence that latent spaces can be shaped to support algebraic operations, enabling latent plugins for pretrained models, with future work extending to real, high-dimensional audio encodings.

Abstract

We investigate the construction of latent spaces through self-supervised learning to support semantically meaningful operations. Analogous to operational amplifiers, these "operational latent spaces" (OpLaS) not only demonstrate semantic structure such as clustering but also support common transformational operations with inherent semantic meaning. Some operational latent spaces are found to have arisen "unintentionally" in the progress toward some (other) self-supervised learning objective, in which unintended but still useful properties are discovered among the relationships of points in the space. Other spaces may be constructed "intentionally" by developers stipulating certain kinds of clustering or transformations intended to produce the desired structure. We focus on the intentional creation of operational latent spaces via self-supervised learning, including the introduction of rotation operators via a novel "FiLMR" layer, which can be used to enable ring-like symmetries found in some musical constructions.

Operational Latent Spaces

TL;DR

The paper addresses building operational latent spaces (OpLaS) in pretrained audio representations to enable semantically meaningful operations. It introduces two self-supervised strategies: a mixing framework that learns an invertible projector so the sum of embeddings aligns with the embedding of the mix, and FiLMR-based rotations where a rotation matrix (constructed via the Aguilera-Perez method) enables ring-like transformations; The Stargate Problem illustrates ring symmetry and tests the ability of FiLMR to realize forward-step transformations that square matrices struggle to learn. The work presents preliminary, high-level evidence that latent spaces can be shaped to support algebraic operations, enabling latent plugins for pretrained models, with future work extending to real, high-dimensional audio encodings.

Abstract

We investigate the construction of latent spaces through self-supervised learning to support semantically meaningful operations. Analogous to operational amplifiers, these "operational latent spaces" (OpLaS) not only demonstrate semantic structure such as clustering but also support common transformational operations with inherent semantic meaning. Some operational latent spaces are found to have arisen "unintentionally" in the progress toward some (other) self-supervised learning objective, in which unintended but still useful properties are discovered among the relationships of points in the space. Other spaces may be constructed "intentionally" by developers stipulating certain kinds of clustering or transformations intended to produce the desired structure. We focus on the intentional creation of operational latent spaces via self-supervised learning, including the introduction of rotation operators via a novel "FiLMR" layer, which can be used to enable ring-like symmetries found in some musical constructions.
Paper Structure (6 sections, 1 equation, 6 figures, 1 algorithm)

This paper contains 6 sections, 1 equation, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Encoded stems and mixes from the MUSDB18 MUSDB18 audio dataset using the VGGish (top row) and CLAP (bottom row) pretrained encoding models, visualized using PCA (left column) and UMAP (right column). We see that while different stems encode to similar locations, their sums (brown markers) are far from the mix encodings (purple markers), illustrating the nonlinearity of these encoding models.
  • Figure 2: Mixing with embeddings. a) Flowchart of the algorithm, inspired by a similar flowchart from the VICReg paper VICReg shown in b) for comparison. c) Implementation using two classes of 2-D "dots" as proxies for audio stems. The sum of the stems $x_i$ appears in the bottom left in green as the "mix". In the middle column, we apply some nonlinear twisting and leveling to the "dots" in the left column. In the bottom right, the sums of the embeddings (purple shapes) lie right on top of the embeddings of the mixes (green shapes). Finally, the yellow dots in the bottom middle covering the green dots confirm that we have learned an invertible mapping.
  • Figure 3: Mixing in latent space: subtracting the "drums" vector. Here, the signals denoted by "vocal," "bass," "drums" and their time-domain sum "mix" are first embedded in a space $Y$ and then projected into $Z$. We then compare the projected vectors for the mix without the drums (in the time domain) shown in orange with the "audio algebra" result of subtracting the vector for "drums" from the "mix" vector. We see that these are very close to each other in the projected space $Z$.
  • Figure 4: a) Progress of the Stargate Problem using FiLMR layer. "S-" in the top left of each pane indicates the training step number. b) In contrast, evolution using a learned square orthogonal matrix. While such a solution should exist in theory, the neural network fails to learn the appropriate transformations, perhaps due to dynamic instability. See Figure \ref{['fig:stargate-end']} for a zoomed view of the final simulation states.
  • Figure 5: a): "Final" successful state of model trying the Stargate Problem via a FiLMR layer. The red and pink colors and numbers are intended to show points lining up on top of their "targets," i.e., the next points in the sequence. b): Unsuccessful result of trying to use a learned orthogonal square matrix.
  • ...and 1 more figures