Table of Contents
Fetching ...

Pitch Contour Exploration Across Audio Domains: A Vision-Based Transfer Learning Approach

Jakob Abeßer, Simon Schwär, Meinard Müller

TL;DR

This work tackles cross-domain pitch contour analysis by moving beyond explicit pitch tracking to a vision-based approach that processes time-frequency representations with a CNN backbone. It introduces the Synthetic Pitch Contour (SPC) dataset to cover seven PC types and enables two-stage transfer learning: pretraining on ImageNet for natural-image features followed by SPC-based PC pretraining, then fine-tuning on diverse downstream tasks. Across eight downstream datasets spanning music, speech, bioacoustics, and everyday sounds, the vision-based VI-2D method consistently matches or exceeds pitch-tracking–based PT-1D, with pretraining on ImageNet and SPC providing notable gains. The approach establishes a foundation for cross-domain comparisons of pitch-contour properties and suggests future work on longer contours and cross-species analyses.

Abstract

This study examines pitch contours as a unifying semantic construct prevalent across various audio domains including music, speech, bioacoustics, and everyday sounds. Analyzing pitch contours offers insights into the universal role of pitch in the perceptual processing of audio signals and contributes to a deeper understanding of auditory mechanisms in both humans and animals. Conventional pitch-tracking methods, while optimized for music and speech, face challenges in handling much broader frequency ranges and more rapid pitch variations found in other audio domains. This study introduces a vision-based approach to pitch contour analysis that eliminates the need for explicit pitch-tracking. The approach uses a convolutional neural network, pre-trained for object detection in natural images and fine-tuned with a dataset of synthetically generated pitch contours, to extract key contour parameters from the time-frequency representation of short audio segments. A diverse set of eight downstream tasks from four audio domains were selected to provide a challenging evaluation scenario for cross-domain pitch contour analysis. The results show that the proposed method consistently surpasses traditional techniques based on pitch-tracking on a wide range of tasks. This suggests that the vision-based approach establishes a foundation for comparative studies of pitch contour characteristics across diverse audio domains.

Pitch Contour Exploration Across Audio Domains: A Vision-Based Transfer Learning Approach

TL;DR

This work tackles cross-domain pitch contour analysis by moving beyond explicit pitch tracking to a vision-based approach that processes time-frequency representations with a CNN backbone. It introduces the Synthetic Pitch Contour (SPC) dataset to cover seven PC types and enables two-stage transfer learning: pretraining on ImageNet for natural-image features followed by SPC-based PC pretraining, then fine-tuning on diverse downstream tasks. Across eight downstream datasets spanning music, speech, bioacoustics, and everyday sounds, the vision-based VI-2D method consistently matches or exceeds pitch-tracking–based PT-1D, with pretraining on ImageNet and SPC providing notable gains. The approach establishes a foundation for cross-domain comparisons of pitch-contour properties and suggests future work on longer contours and cross-species analyses.

Abstract

This study examines pitch contours as a unifying semantic construct prevalent across various audio domains including music, speech, bioacoustics, and everyday sounds. Analyzing pitch contours offers insights into the universal role of pitch in the perceptual processing of audio signals and contributes to a deeper understanding of auditory mechanisms in both humans and animals. Conventional pitch-tracking methods, while optimized for music and speech, face challenges in handling much broader frequency ranges and more rapid pitch variations found in other audio domains. This study introduces a vision-based approach to pitch contour analysis that eliminates the need for explicit pitch-tracking. The approach uses a convolutional neural network, pre-trained for object detection in natural images and fine-tuned with a dataset of synthetically generated pitch contours, to extract key contour parameters from the time-frequency representation of short audio segments. A diverse set of eight downstream tasks from four audio domains were selected to provide a challenging evaluation scenario for cross-domain pitch contour analysis. The results show that the proposed method consistently surpasses traditional techniques based on pitch-tracking on a wide range of tasks. This suggests that the vision-based approach establishes a foundation for comparative studies of pitch contour characteristics across diverse audio domains.

Paper Structure

This paper contains 26 sections, 17 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Excerpts of pitch contours selected from four different audio domains covering speech (S), music (M), bioacoustics (B), and everyday sounds (E). Each pitch contour is displayed as a one-second Constant-Q spectrogram with log-scaled magnitude.
  • Figure 2: Two approaches for PC analysis. (1) End-to-end approach (PT-1D) with PCs extracted using a pitch-tracking algorithm and being processed by a deep neural network (DNN) model with a trainable convolutional front-end. (2) Proposed vision-based approach (VI-2D) with PCs being captured as time--frequency (TF) representations of audio clips and processed by a DNN model using a MobileNetV2 front-end, which has been pre-trained on ImageNet. In both approaches, pre-trained models are later fine-tuned for different downstream classification tasks.
  • Figure 3: Three example contours for each of the seven PC types.
  • Figure 4: Common fundamental frequency ranges across everyday sounds, animal vocalizations, and musical instruments. The frequency range between 25Hz and 10kHz, which is used to sample the base frequency values $f_\mathrm{b}$ in the SPC dataset, is marked as gray rectangle. ($^*$infrasonic call, $^{**}$low frequency calls)
  • Figure 5: Results for the pYIN, SWIPE, and CREPE pitch-tracking algorithms over stable, glissando, and vibrato PCs in the SPC dataset. Raw pitch accuracy (RPA50) is shown over the contour base frequency $f_\mathrm{b}$.
  • ...and 3 more figures