Musical Metamerism with Time--Frequency Scattering
Vincent Lostanlen, Han Han
TL;DR
The paper addresses understanding musical familiarity through contour-based perception and proposes a method to synthesize musical metamers from any audio using joint time--frequency scattering (JTFS). By coarsening JTFS coefficients with Gaussian averaging over time and log-frequency, metamers are invariant to temporal shifts up to $T$ and transpositions up to $F$, enabling gradient-based reconstruction from random initializations. The approach yields metamers without manual preprocessing like transcription or beat tracking, bridging cognitive science questions with advanced signal-processing representations and supporting reproducible research. Connections to spectrotemporal representations such as STRF, MPS, and GBFB are discussed, and the method is implemented in Kymatio with GPU support for practical use in music cognition experiments and beyond.
Abstract
The concept of metamerism originates from colorimetry, where it describes a sensation of visual similarity between two colored lights despite significant differences in spectral content. Likewise, we propose to call ``musical metamerism'' the sensation of auditory similarity which is elicited by two music fragments which differ in terms of underlying waveforms. In this technical report, we describe a method to generate musical metamers from any audio recording. Our method is based on joint time--frequency scattering in Kymatio, an open-source software in Python which enables GPU computing and automatic differentiation. The advantage of our method is that it does not require any manual preprocessing, such as transcription, beat tracking, or source separation. We provide a mathematical description of JTFS as well as some excerpts from the Kymatio source code. Lastly, we review the prior work on JTFS and draw connections with closely related algorithms, such as spectrotemporal receptive fields (STRF), modulation power spectra (MPS), and Gabor filterbank (GBFB).
