Table of Contents
Fetching ...

Spiking Music: Audio Compression with Event Based Auto-encoders

Martim Lisboa, Guillaume Bellec

TL;DR

This work investigates whether audio compression can benefit from event-based representations inspired by neural spikes. It introduces Spiking Music, an end-to-end binary auto-encoder whose latent is a binary matrix $z \in \{0,1\}^{N,T_z}$, replacing vector quantization with a differentiable binary quantizer and enabling both dense and sparse storage regimes. The paper demonstrates competitive reconstruction at about $3$ kbps in the dense setting and reaches $2.59$ kbps in the sparse regime through a sparsity-driven training schedule, with additional gains from a controllable $\mu$-SPARSE variant that targets bitrate while maintaining quality. A key finding is that, in the sparse regime, the latent units become selective and synchronized with piano note onsets, signaling that the event-based code captures high-level musical structure with potential energy-efficiency benefits for future hardware implementations.

Abstract

Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.

Spiking Music: Audio Compression with Event Based Auto-encoders

TL;DR

This work investigates whether audio compression can benefit from event-based representations inspired by neural spikes. It introduces Spiking Music, an end-to-end binary auto-encoder whose latent is a binary matrix , replacing vector quantization with a differentiable binary quantizer and enabling both dense and sparse storage regimes. The paper demonstrates competitive reconstruction at about kbps in the dense setting and reaches kbps in the sparse regime through a sparsity-driven training schedule, with additional gains from a controllable -SPARSE variant that targets bitrate while maintaining quality. A key finding is that, in the sparse regime, the latent units become selective and synchronized with piano note onsets, signaling that the event-based code captures high-level musical structure with potential energy-efficiency benefits for future hardware implementations.

Abstract

Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.
Paper Structure (19 sections, 9 equations, 5 figures)

This paper contains 19 sections, 9 equations, 5 figures.

Figures (5)

  • Figure 1: Storage of sparse binary matrices Given a binary event matrix $\boldsymbol{z}$ with $N=80$ units, $T=1024$ time steps, and $S$ events, the storage cost is given by an exact formula. There are four regimes where each of the 4 matrix storage formats is optimal.
  • Figure 2: A binary auto-encoder architecture.A Our encoder is taken from tagliasacchi2020seanetdefossez2022high and we report results using the Mousai diffusion decoder preechakul2022diffusion. B Architecture of the SPARSE model. The embeddings and compression prompt $\mu$ are only used in the $\mu$-SPARSE model. C Details of the "Integrate context" and "Binary quantizer" blocks.
  • Figure 3: Perceptual quality ratings.A Summary of the MUSHRA ratings of audio reconstruction quality. B Number of listening tests where the Winner model was rated better than the Loser model. A yellow cell means that the number of wins is not significantly different from the number of losses.
  • Figure 4: Controllable reconstruction quality.A Bit rate versus signal-to-noise ratio of 8 samples across the different compression levels $\mu$. The "quality-controlled" value of $\mu$ is the highest possible $\mu$ that reconstructs the sample with SI-SNR>9. The green balls are linked to the spiking representations in panel C. B Distribution of Bit rate versus SI-SNR ratio for the SPARSE and $\mu$-SPARSE with $\mu = 16$. The mean value is indicated by the triangle. Notice that the $\mu$ model ensures concentration around the target bit rate and limits the drops in SI-SNR. C Event matrix of the single audio sample at different compression levels.
  • Figure 5: Selectivity and synchrony with piano key strikes.A-B Event-note correlation functions for FREE (A) and SPARSE (B) models. C We display the distribution of the peak prominence $\phi^{i\alpha}$ for all notes in the range A1-A6. We show the 15 units with the highest correlation with $\alpha=A2$. The bottom plot shows the distribution of peak prominence for the first unit.