Learning Perceptually Relevant Temporal Envelope Morphing

Satvik Dixit; Sungjoon Park; Chris Donahue; Laurie M. Heller

Learning Perceptually Relevant Temporal Envelope Morphing

Satvik Dixit, Sungjoon Park, Chris Donahue, Laurie M. Heller

TL;DR

The paper tackles the lack of perceptually grounded temporal envelope morphing by deriving principles from human listening studies and training a supervised model on synthetic data that encodes these rules. It combines a $64$-dimensional latent autoencoder for envelopes with a twin, order-invariant network to map two inputs and an interpolation weight $\alpha$ to a perceptually valid morphed envelope. Through synthetic and naturalistic benchmarks, the approach consistently yields more natural intermediate morphs than traditional mixing, DTW, or latent-space interpolation baselines, and demonstrates strong generalization to compositional and real-world signals. This work bridges psychoacoustic findings with deep learning to enable controllable and perceptually plausible sound blending for creative, assistive, and research applications.

Abstract

Temporal envelope morphing, the process of interpolating between the amplitude dynamics of two audio signals, is an emerging problem in generative audio systems that lacks sufficient perceptual grounding. Morphing of temporal envelopes in a perceptually intuitive manner should enable new methods for sound blending in creative media and for probing perceptual organization in psychoacoustics. However, existing audio morphing techniques often fail to produce intermediate temporal envelopes when input sounds have distinct temporal structures; many morphers effectively overlay both temporal structures, leading to perceptually unnatural results. In this paper, we introduce a novel workflow for learning envelope morphing with perceptual guidance: we first derive perceptually grounded morphing principles through human listening studies, then synthesize large-scale datasets encoding these principles, and finally train machine learning models to create perceptually intermediate morphs. Specifically, we present: (1) perceptual principles that guide envelope morphing, derived from our listening studies, (2) a supervised framework to learn these principles, (3) an autoencoder that learns to compress temporal envelope structures into latent representations, and (4) benchmarks for evaluating audio envelope morphs, using both synthetic and naturalistic data, and show that our approach outperforms existing methods in producing temporally intermediate morphs. All code, models, and checkpoints are available at https://github.com/TemporalMorphing/EnvelopeMorphing.

Learning Perceptually Relevant Temporal Envelope Morphing

TL;DR

Abstract

Learning Perceptually Relevant Temporal Envelope Morphing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)