Table of Contents
Fetching ...

ConvNets for Counting: Object Detection of Transient Phenomena in Steelpan Drums

Scott H. Hawley, Andrew C. Morrison

TL;DR

The paper presents SPNet, a CNN-based detector trained to count interference fringes within elliptical antinode regions in ESPI videos of transient steelpan vibrations. By combining crowdsourced SVP annotations with synthetic, style-transferred data, the authors demonstrate high accuracy on synthetic datasets and provide initial, physically meaningful insights from optical measurements, including octave-frequency alignment and notable delays relative to acoustic signals. The work highlights the challenges of real-world annotation variability and proposes path forward through improved labeling, transfer learning, and physics-informed data generation. This approach offers a scalable framework for extracting time-dependent vibrational dynamics from ESPI imagery, with potential applicability to other musical-instrument diagnostics and transient ESPI analyses.

Abstract

We train an object detector built from convolutional neural networks to count interference fringes in elliptical antinode regions in frames of high-speed video recordings of transient oscillations in Caribbean steelpan drums illuminated by electronic speckle pattern interferometry (ESPI). The annotations provided by our model aim to contribute to the understanding of time-dependent behavior in such drums by tracking the development of sympathetic vibration modes. The system is trained on a dataset of crowdsourced human-annotated images obtained from the Zooniverse Steelpan Vibrations Project. Due to the small number of human-annotated images and the ambiguity of the annotation task, we also evaluate the model on a large corpus of synthetic images whose properties have been matched to the real images by style transfer using a Generative Adversarial Network. Applying the model to thousands of unlabeled video frames, we measure oscillations consistent with audio recordings of these drum strikes. One unanticipated result is that sympathetic oscillations of higher-octave notes significantly precede the rise in sound intensity of the corresponding second harmonic tones; the mechanism responsible for this remains unidentified. This paper primarily concerns the development of the predictive model; further exploration of the steelpan images and deeper physical insights await its further application.

ConvNets for Counting: Object Detection of Transient Phenomena in Steelpan Drums

TL;DR

The paper presents SPNet, a CNN-based detector trained to count interference fringes within elliptical antinode regions in ESPI videos of transient steelpan vibrations. By combining crowdsourced SVP annotations with synthetic, style-transferred data, the authors demonstrate high accuracy on synthetic datasets and provide initial, physically meaningful insights from optical measurements, including octave-frequency alignment and notable delays relative to acoustic signals. The work highlights the challenges of real-world annotation variability and proposes path forward through improved labeling, transfer learning, and physics-informed data generation. This approach offers a scalable framework for extracting time-dependent vibrational dynamics from ESPI imagery, with potential applicability to other musical-instrument diagnostics and transient ESPI analyses.

Abstract

We train an object detector built from convolutional neural networks to count interference fringes in elliptical antinode regions in frames of high-speed video recordings of transient oscillations in Caribbean steelpan drums illuminated by electronic speckle pattern interferometry (ESPI). The annotations provided by our model aim to contribute to the understanding of time-dependent behavior in such drums by tracking the development of sympathetic vibration modes. The system is trained on a dataset of crowdsourced human-annotated images obtained from the Zooniverse Steelpan Vibrations Project. Due to the small number of human-annotated images and the ambiguity of the annotation task, we also evaluate the model on a large corpus of synthetic images whose properties have been matched to the real images by style transfer using a Generative Adversarial Network. Applying the model to thousands of unlabeled video frames, we measure oscillations consistent with audio recordings of these drum strikes. One unanticipated result is that sympathetic oscillations of higher-octave notes significantly precede the rise in sound intensity of the corresponding second harmonic tones; the mechanism responsible for this remains unidentified. This paper primarily concerns the development of the predictive model; further exploration of the steelpan images and deeper physical insights await its further application.

Paper Structure

This paper contains 16 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (color online) Illustration of Steelpan Vibrations ProjectSVP (SVP) task: Ellipses "drawn" (in green) by human annotators around antinodes in an ESPI steelpan video frame via the Zooniverse crowd-sourcing data annotation interface. Not shown: Annotations also include users' counts of the number of interference fringes or rings for each antinode region.
  • Figure 2: (color online) Graphical representation of one aspect of the variability in the aggregated human annotations comprising the SVP dataset. While, physically, antinodes typically persist over 50 to hundreds of frames, the fine structure of the raw data in this graph shows that the presence of some antinodes may or may not have been annotated consistently frame-by-frame (even in the aggregated data). This is the dataset used to train and score the SPNet model. This does not display (the further) variability in ring counts, only whether an antinode is marked.
  • Figure 3: (color online) Diagram of the SPNet architecture. The grayscale input image is resized via average pooling and two additional ("color") channels are added via 3x3 convolutions before feeding into a "stock" base model chosen from available Keras models (as described in the text, we prefer Xceptionxception), which is then fully connected to a flattened layer which holds the values of a 6x6x2 grid of predictors for the 8 variables in Table \ref{['table:vars']}. ($6\times 6\times 2\times 8 = 576$ values in the model output.) The operations to the left of the base model can be regarded as a "residual block" designed to shrink the image to lower memory costs while still retaining some finer details of the larger input image. Also shown as an array of red dots on the input image are the centroids of regions covered by the predictors, which predict antinode centroid coordinates in terms of offsets from these locations. Not shown: Leaky ReLU activations and batch normalization between layers. (Note: the images shown for intermediate layers are "artwork," not actual layer activations.)
  • Figure 4: (color online) Sample fake images, showing ground-truth bounding ellipses and ring counts (upper values, light-yellow) and those predicted by the network (lower values, dark-purple). Top: original style of fake image, from FakeLarge dataset. Bottom: same fake image with "real" style transferred via CycleGANCycleGAN2017, from CGLarge dataset.
  • Figure 5: (color online) Training progress. Top: various components of the loss function for dataset FakeLarge. (A similar graph for Real would show Validation loss values leveling off after approximately 20 epochs, which is where the Training loss crosses the Validation loss.) Bottom: Classification-like accuracy scores for ring counts for validation subsets of all datasets. Despite FakeSmall, CGSmall, and Real all having similar numbers of training images (ca. 1200, when are then augmented as per Section \ref{['subsubsec:data_aug']}), FakeLarge and CGSmall have much higher accuracy scores than Real. The fact that the accuracy for Real does not improve beyond Epoch 20 indicates the variability of the human-supplied data annotations.
  • ...and 4 more figures