HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks

Filip Szatkowski; Karol J. Piczak; Przemysław Spurek; Jacek Tabor; Tomasz Trzciński

HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks

Filip Szatkowski, Karol J. Piczak, Przemysław Spurek, Jacek Tabor, Tomasz Trzciński

TL;DR

HyperSound tackles generating implicit neural representations for audio that generalize to unseen samples by learning a hypernetwork that produces the weights ${\theta}_{\bm{x}} = H(\bm{x})$ for a small target network $T$, so the audio waveform is ${\hat{x}}(t) = T(t, {\theta}_{\bm{x}})$. The model uses a SoundStream-based encoder and NeRF-inspired positional embeddings with a multi-layer MLP target network, trained with a joint time-frequency loss $L = {\lambda}_{SL1} L_{SL1} + {\lambda}_{STFT} L_{STFT}$. Empirically, HyperSound achieves reconstructions competitive with the RAVE baseline on perceptual metrics (e.g., PESQ, STOI) and spectral measures, while larger gains in some objective metrics (e.g., MSE, SI-SNR, CDPAM) are observed for RAVE. The work demonstrates the viability of audio INR generation via hypernetworks and points to improvements in hypernetwork/target-network design and potential applications in compression.

Abstract

Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals. Recent applications of INRs include image super-resolution, compression of high-dimensional signals, or 3D rendering. However, these solutions usually focus on visual data, and adapting them to the audio domain is not trivial. Moreover, it requires a separately trained model for every data sample. To address this limitation, we propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time. We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.

HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks

TL;DR

HyperSound tackles generating implicit neural representations for audio that generalize to unseen samples by learning a hypernetwork that produces the weights

for a small target network

, so the audio waveform is

. The model uses a SoundStream-based encoder and NeRF-inspired positional embeddings with a multi-layer MLP target network, trained with a joint time-frequency loss

. Empirically, HyperSound achieves reconstructions competitive with the RAVE baseline on perceptual metrics (e.g., PESQ, STOI) and spectral measures, while larger gains in some objective metrics (e.g., MSE, SI-SNR, CDPAM) are observed for RAVE. The work demonstrates the viability of audio INR generation via hypernetworks and points to improvements in hypernetwork/target-network design and potential applications in compression.

Abstract

Paper Structure (8 sections, 4 equations, 2 figures, 1 table)

This paper contains 8 sections, 4 equations, 2 figures, 1 table.

Introduction
Related Works
Model overview
Hypernetwork architecture
Approximating sound waves with neural networks
Optimization
Experiments
Conclusion

Figures (2)

Figure 1: Overview of the HyperSound framework. We use a single hypernetwork model to produce distinct INRs based on arbitrary audio signals provided as input.
Figure 2: Examples of VCTK validation samples reconstructed with HyperSound.

HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks

TL;DR

Abstract

HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks

Authors

TL;DR

Abstract

Table of Contents

Figures (2)