Variable-Length Audio Fingerprinting

Hongjie Chen; Hanyu Meng; Huimin Zeng; Ryan A. Rossi; Lie Lu; Josh Kimball

Variable-Length Audio Fingerprinting

Hongjie Chen, Hanyu Meng, Huimin Zeng, Ryan A. Rossi, Lie Lu, Josh Kimball

Abstract

Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.

Variable-Length Audio Fingerprinting

Abstract

Paper Structure (31 sections, 6 equations, 16 figures, 18 tables, 5 algorithms)

This paper contains 31 sections, 6 equations, 16 figures, 18 tables, 5 algorithms.

Introduction
Related Work
Variable-Length Audio Fingerprinting
Variable-length Dual-attention Transformer
Objective
Experimental Setup
Tasks
Results
Commercial-Broadcast Retrieval
Dummy-Target Retrieval
Model Size and Runtime
Ablation Study
Impacts of Hyperparameters
Conclusion
Details of Preliminaries
...and 16 more sections

Figures (16)

Figure 1: Limitations of fixed-length audio fingerprinting. The top row shows a $3$-second excerpt (containing both interlude and verse) in its original form and with a $1.5\times$ speed-up. The middle and bottom rows compare fixed-length segmentation and variable-length segmentation. Fixed-length segmentation suffers from three issues: Loss of Natural Boundaries. Segments cut across semantic units ((a)-2), complicating interpretation. Redundant or Noisy Context. Segments oversimplify ((a)-1) or combine too much information ((a)-2). Distortion Incompatibility. Time-stretch prevents exact matching; no subfigure in (b) aligns perfectly with (a)'s. Variable-length segmentation overcomes these issues by producing segments aligned with semantic boundaries. Our proposed VLAFP addresses these.
Figure 2: The architecture of our proposed VLAFP model. (a) Initial Projection: Audio $\boldsymbol{\mathrm{A}}$ in its spectrogram representation is projected through a linear layer. (b) Inter-frame Self-Attention: Multi-head self-attention layers learn inter-frame relationships. (c) Frame-to-segment Cross-Attention: Multi-head cross-attention layers model the frame-to-segment relationships. (d) Segment Embedding Initialization: Replicas of segment embeddings are initialized through frame-to-segment pooling. (e) Fingerprint Summarization: Replicas of segment embeddings are aggregated and L2-normalized to generate an audio fingerprint $\mathbf{z}$.
Figure 3: Segment count and average length.
Figure 4: Segment length distribution.
Figure 5: DTR results of VLAFP on FMA with different $\theta$.
...and 11 more figures

Variable-Length Audio Fingerprinting

Abstract

Variable-Length Audio Fingerprinting

Authors

Abstract

Table of Contents

Figures (16)