Spiking Music: Audio Compression with Event Based Auto-encoders

Martim Lisboa; Guillaume Bellec

Spiking Music: Audio Compression with Event Based Auto-encoders

Martim Lisboa, Guillaume Bellec

TL;DR

This work investigates whether audio compression can benefit from event-based representations inspired by neural spikes. It introduces Spiking Music, an end-to-end binary auto-encoder whose latent is a binary matrix $z \in \{0,1\}^{N,T_z}$, replacing vector quantization with a differentiable binary quantizer and enabling both dense and sparse storage regimes. The paper demonstrates competitive reconstruction at about $3$ kbps in the dense setting and reaches $2.59$ kbps in the sparse regime through a sparsity-driven training schedule, with additional gains from a controllable $\mu$-SPARSE variant that targets bitrate while maintaining quality. A key finding is that, in the sparse regime, the latent units become selective and synchronized with piano note onsets, signaling that the event-based code captures high-level musical structure with potential energy-efficiency benefits for future hardware implementations.

Abstract

Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.

Spiking Music: Audio Compression with Event Based Auto-encoders

TL;DR

, replacing vector quantization with a differentiable binary quantizer and enabling both dense and sparse storage regimes. The paper demonstrates competitive reconstruction at about

kbps in the dense setting and reaches

kbps in the sparse regime through a sparsity-driven training schedule, with additional gains from a controllable

-SPARSE variant that targets bitrate while maintaining quality. A key finding is that, in the sparse regime, the latent units become selective and synchronized with piano note onsets, signaling that the event-based code captures high-level musical structure with potential energy-efficiency benefits for future hardware implementations.

Abstract

Paper Structure (19 sections, 9 equations, 5 figures)

This paper contains 19 sections, 9 equations, 5 figures.

Introduction
Related work
An event-based neural compression method
Recall on neural compression
Binary sparse matrix storage
Compressed time/units formats
Sparse and dense regimes
Spiking Music compression: training and architecture
Sparse Spiking Music compression
Compression of piano recordings
Dataset and training setup
RVQ baseline
Comparison with similar bit budget
SI-SNR and MUSHRA score
Spiking Music compression quality
...and 4 more sections

Figures (5)

Figure 1: Storage of sparse binary matrices Given a binary event matrix $\boldsymbol{z}$ with $N=80$ units, $T=1024$ time steps, and $S$ events, the storage cost is given by an exact formula. There are four regimes where each of the 4 matrix storage formats is optimal.
Figure 2: A binary auto-encoder architecture.A Our encoder is taken from tagliasacchi2020seanetdefossez2022high and we report results using the Mousai diffusion decoder preechakul2022diffusion. B Architecture of the SPARSE model. The embeddings and compression prompt $\mu$ are only used in the $\mu$-SPARSE model. C Details of the "Integrate context" and "Binary quantizer" blocks.
Figure 3: Perceptual quality ratings.A Summary of the MUSHRA ratings of audio reconstruction quality. B Number of listening tests where the Winner model was rated better than the Loser model. A yellow cell means that the number of wins is not significantly different from the number of losses.
Figure 4: Controllable reconstruction quality.A Bit rate versus signal-to-noise ratio of 8 samples across the different compression levels $\mu$. The "quality-controlled" value of $\mu$ is the highest possible $\mu$ that reconstructs the sample with SI-SNR>9. The green balls are linked to the spiking representations in panel C. B Distribution of Bit rate versus SI-SNR ratio for the SPARSE and $\mu$-SPARSE with $\mu = 16$. The mean value is indicated by the triangle. Notice that the $\mu$ model ensures concentration around the target bit rate and limits the drops in SI-SNR. C Event matrix of the single audio sample at different compression levels.
Figure 5: Selectivity and synchrony with piano key strikes.A-B Event-note correlation functions for FREE (A) and SPARSE (B) models. C We display the distribution of the peak prominence $\phi^{i\alpha}$ for all notes in the range A1-A6. We show the 15 units with the highest correlation with $\alpha=A2$. The bottom plot shows the distribution of peak prominence for the first unit.

Spiking Music: Audio Compression with Event Based Auto-encoders

TL;DR

Abstract

Spiking Music: Audio Compression with Event Based Auto-encoders

Authors

TL;DR

Abstract

Table of Contents

Figures (5)