As Good as It KAN Get: High-Fidelity Audio Representation
Patryk Marszałek, Maciej Rut, Piotr Kawa, Przemysław Spurek, Piotr Syga
TL;DR
This work studies audio implicit neural representations with Kolmogorov-Arnold Networks (KAN), proposing learnable spline activations to model audio signals efficiently. It introduces FewSound, a hypernetwork-based meta-learning framework that adapts universal INR weights for specific audio tasks, achieving substantial improvements over prior methods like HyperSound in MSE, SI-SNR, and perceptual quality. Across extensive experiments on short and long audio, KAN demonstrates competitive or superior reconstruction fidelity, high-frequency preservation, and strong robustness to encoder choices and datasets. The combination of KAN with FewSound establishes a scalable, adaptable approach for audio representation and compression, with practical potential in codec integration and multilingual settings.
Abstract
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.
