Table of Contents
Fetching ...

Variable-Length Audio Fingerprinting

Hongjie Chen, Hanyu Meng, Huimin Zeng, Ryan A. Rossi, Lie Lu, Josh Kimball

Abstract

Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.

Variable-Length Audio Fingerprinting

Abstract

Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.
Paper Structure (31 sections, 6 equations, 16 figures, 18 tables, 5 algorithms)

This paper contains 31 sections, 6 equations, 16 figures, 18 tables, 5 algorithms.

Figures (16)

  • Figure 1: Limitations of fixed-length audio fingerprinting. The top row shows a $3$-second excerpt (containing both interlude and verse) in its original form and with a $1.5\times$ speed-up. The middle and bottom rows compare fixed-length segmentation and variable-length segmentation. Fixed-length segmentation suffers from three issues: Loss of Natural Boundaries. Segments cut across semantic units ((a)-2), complicating interpretation. Redundant or Noisy Context. Segments oversimplify ((a)-1) or combine too much information ((a)-2). Distortion Incompatibility. Time-stretch prevents exact matching; no subfigure in (b) aligns perfectly with (a)'s. Variable-length segmentation overcomes these issues by producing segments aligned with semantic boundaries. Our proposed VLAFP addresses these.
  • Figure 2: The architecture of our proposed VLAFP model. (a) Initial Projection: Audio $\boldsymbol{\mathrm{A}}$ in its spectrogram representation is projected through a linear layer. (b) Inter-frame Self-Attention: Multi-head self-attention layers learn inter-frame relationships. (c) Frame-to-segment Cross-Attention: Multi-head cross-attention layers model the frame-to-segment relationships. (d) Segment Embedding Initialization: Replicas of segment embeddings are initialized through frame-to-segment pooling. (e) Fingerprint Summarization: Replicas of segment embeddings are aggregated and L2-normalized to generate an audio fingerprint $\mathbf{z}$.
  • Figure 3: Segment count and average length.
  • Figure 4: Segment length distribution.
  • Figure 5: DTR results of VLAFP on FMA with different $\theta$.
  • ...and 11 more figures