Table of Contents
Fetching ...

Learning Perceptually Relevant Temporal Envelope Morphing

Satvik Dixit, Sungjoon Park, Chris Donahue, Laurie M. Heller

TL;DR

The paper tackles the lack of perceptually grounded temporal envelope morphing by deriving principles from human listening studies and training a supervised model on synthetic data that encodes these rules. It combines a $64$-dimensional latent autoencoder for envelopes with a twin, order-invariant network to map two inputs and an interpolation weight $\alpha$ to a perceptually valid morphed envelope. Through synthetic and naturalistic benchmarks, the approach consistently yields more natural intermediate morphs than traditional mixing, DTW, or latent-space interpolation baselines, and demonstrates strong generalization to compositional and real-world signals. This work bridges psychoacoustic findings with deep learning to enable controllable and perceptually plausible sound blending for creative, assistive, and research applications.

Abstract

Temporal envelope morphing, the process of interpolating between the amplitude dynamics of two audio signals, is an emerging problem in generative audio systems that lacks sufficient perceptual grounding. Morphing of temporal envelopes in a perceptually intuitive manner should enable new methods for sound blending in creative media and for probing perceptual organization in psychoacoustics. However, existing audio morphing techniques often fail to produce intermediate temporal envelopes when input sounds have distinct temporal structures; many morphers effectively overlay both temporal structures, leading to perceptually unnatural results. In this paper, we introduce a novel workflow for learning envelope morphing with perceptual guidance: we first derive perceptually grounded morphing principles through human listening studies, then synthesize large-scale datasets encoding these principles, and finally train machine learning models to create perceptually intermediate morphs. Specifically, we present: (1) perceptual principles that guide envelope morphing, derived from our listening studies, (2) a supervised framework to learn these principles, (3) an autoencoder that learns to compress temporal envelope structures into latent representations, and (4) benchmarks for evaluating audio envelope morphs, using both synthetic and naturalistic data, and show that our approach outperforms existing methods in producing temporally intermediate morphs. All code, models, and checkpoints are available at https://github.com/TemporalMorphing/EnvelopeMorphing.

Learning Perceptually Relevant Temporal Envelope Morphing

TL;DR

The paper tackles the lack of perceptually grounded temporal envelope morphing by deriving principles from human listening studies and training a supervised model on synthetic data that encodes these rules. It combines a -dimensional latent autoencoder for envelopes with a twin, order-invariant network to map two inputs and an interpolation weight to a perceptually valid morphed envelope. Through synthetic and naturalistic benchmarks, the approach consistently yields more natural intermediate morphs than traditional mixing, DTW, or latent-space interpolation baselines, and demonstrates strong generalization to compositional and real-world signals. This work bridges psychoacoustic findings with deep learning to enable controllable and perceptually plausible sound blending for creative, assistive, and research applications.

Abstract

Temporal envelope morphing, the process of interpolating between the amplitude dynamics of two audio signals, is an emerging problem in generative audio systems that lacks sufficient perceptual grounding. Morphing of temporal envelopes in a perceptually intuitive manner should enable new methods for sound blending in creative media and for probing perceptual organization in psychoacoustics. However, existing audio morphing techniques often fail to produce intermediate temporal envelopes when input sounds have distinct temporal structures; many morphers effectively overlay both temporal structures, leading to perceptually unnatural results. In this paper, we introduce a novel workflow for learning envelope morphing with perceptual guidance: we first derive perceptually grounded morphing principles through human listening studies, then synthesize large-scale datasets encoding these principles, and finally train machine learning models to create perceptually intermediate morphs. Specifically, we present: (1) perceptual principles that guide envelope morphing, derived from our listening studies, (2) a supervised framework to learn these principles, (3) an autoencoder that learns to compress temporal envelope structures into latent representations, and (4) benchmarks for evaluating audio envelope morphs, using both synthetic and naturalistic data, and show that our approach outperforms existing methods in producing temporally intermediate morphs. All code, models, and checkpoints are available at https://github.com/TemporalMorphing/EnvelopeMorphing.

Paper Structure

This paper contains 12 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Method overview. In Stage 1, we repurpose an audio autoencoder to learn a latent representation of temporal envelopes. In Stage 2, we train a twin (order-invariant) mapper network (2b) in a supervised setup on synthetic training data that encodes perceptually-grounded morphing rules (2a). During training, the autoencoder is kept frozen while the mapper network learns to morph the envelope embeddings.
  • Figure 2: Creating data for unsupervised training, supervised training and evaluation in a naturalistic setting. We create large scale datasets of (1a) real world audio envelopes for training the autoencoder (1b) synthetic gaussian impulse based envelopes for training the twin neural network on perception based morphing rules and (2) naturalistic audio envelopes to evaluate the envelope morphing systems on real world like sounds.