Table of Contents
Fetching ...

EnvId: A Metric Learning Approach for Forensic Few-Shot Identification of Unseen Environments

Denise Moussa, Germans Hirsch, Christian Riess

TL;DR

This paper tackles forensic identification of where an audio was recorded by reframing the problem as a few-shot, metric-learning task that avoids case-specific retraining. It introduces EnvId, an end-to-end framework built on Prototypical Networks to perform open-set, N-way $K$-shot environment identification and optional blind regression of environmental parameters such as RT$_{60}$ and volume. A flexible data-generation pipeline simulates realistic reverberant, noisy, and compressed conditions to mirror forensic scenarios, enabling robust evaluation across unseen degradations and out-of-distribution locations. Results show high accuracy on diverse test pools, strong open-set rejection capabilities, and notable robustness to unseen degradations, with EnvId also capable of estimating environmental parameters, thereby providing a practical groundwork for forensic audio analysis in the wild.

Abstract

Audio recordings may provide important evidence in criminal investigations. One such case is the forensic association of a recorded audio to its recording location. For example, a voice message may be the only investigative cue to narrow down the candidate sites for a crime. Up to now, several works provide supervised classification tools for closed-set recording environment identification under relatively clean recording conditions. However, in forensic investigations, the candidate locations are case-specific. Thus, supervised learning techniques are not applicable without retraining a classifier on a sufficient amount of training samples for each case and respective candidate set. In addition, a forensic tool has to deal with audio material from uncontrolled sources with variable properties and quality. In this work, we therefore attempt a major step towards practical forensic application scenarios. We propose a representation learning framework called EnvId, short for environment identification. EnvId avoids case-specific retraining by modeling the task as a few-shot classification problem. We demonstrate that EnvId can handle forensically challenging material. It provides good quality predictions even under unseen signal degradations, out-of-distribution reverberation characteristics or recording position mismatches.

EnvId: A Metric Learning Approach for Forensic Few-Shot Identification of Unseen Environments

TL;DR

This paper tackles forensic identification of where an audio was recorded by reframing the problem as a few-shot, metric-learning task that avoids case-specific retraining. It introduces EnvId, an end-to-end framework built on Prototypical Networks to perform open-set, N-way -shot environment identification and optional blind regression of environmental parameters such as RT and volume. A flexible data-generation pipeline simulates realistic reverberant, noisy, and compressed conditions to mirror forensic scenarios, enabling robust evaluation across unseen degradations and out-of-distribution locations. Results show high accuracy on diverse test pools, strong open-set rejection capabilities, and notable robustness to unseen degradations, with EnvId also capable of estimating environmental parameters, thereby providing a practical groundwork for forensic audio analysis in the wild.

Abstract

Audio recordings may provide important evidence in criminal investigations. One such case is the forensic association of a recorded audio to its recording location. For example, a voice message may be the only investigative cue to narrow down the candidate sites for a crime. Up to now, several works provide supervised classification tools for closed-set recording environment identification under relatively clean recording conditions. However, in forensic investigations, the candidate locations are case-specific. Thus, supervised learning techniques are not applicable without retraining a classifier on a sufficient amount of training samples for each case and respective candidate set. In addition, a forensic tool has to deal with audio material from uncontrolled sources with variable properties and quality. In this work, we therefore attempt a major step towards practical forensic application scenarios. We propose a representation learning framework called EnvId, short for environment identification. EnvId avoids case-specific retraining by modeling the task as a few-shot classification problem. We demonstrate that EnvId can handle forensically challenging material. It provides good quality predictions even under unseen signal degradations, out-of-distribution reverberation characteristics or recording position mismatches.
Paper Structure (45 sections, 12 equations, 7 figures, 5 tables)

This paper contains 45 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our end-to-end trainable EnvId framework for joint few-shot environment identification and blind parameter regression from audio recordings. The framework takes audio signals as inputs (a), and consists of a neural feature extractor (b) and projector (c) to process and map the input samples to the learnable, metric embedding space (d). The audio representations in the metric space can both be used for the identification of environments, and the regression of environmental parameters (e).
  • Figure 2: Proposed pipeline for controlled simulation of real world audio recording and post-processing scenarios. Configurable sets of input signals, environments and degradations (orange) enable the creation of custom test cases. In $3$ steps (purple), anechoic audio signals $a(t)$ pass various transformations and are output in frequency representation. Dashed arrows indicate skip connections to (randomly) enable and disable degradation transformations per sample.
  • Figure 3: curves on our $4$ test sets of noisy and single compressed reverberant speech from Sec. \ref{['subsubsec:dataset_benchmark']} for rejecting samples that do not match any reference recording location.The results are reported for the Gamper$^\star$, GamperCNN gamper2018blind and Götz gotz2023contrastive feature extractor.
  • Figure 4: Averaged Top-{1,2,3} accuracy of 5 training runs for few-shot environment identification under degradation factors unseen during training. We provide benchmarks for the Gamper$^\star$, GamperCNN gamper2018blind and Götz gotz2023contrastive backbones on the MIT traer2016statistics test set for multi compression runs (Fig. \ref{['subfig:multi_c']}), high, mid and low quality settings of the unseen Vorbis (Fig. \ref{['subfig:vorbis']}) and neural EnCodec defossez2022highfi compression codecs (Fig. \ref{['subfig:encodec']}), and unseen real environmental background noise (Fig. \ref{['subfig:noise']}). Note that the y-axis is scaled differently for the individual degradation types to better highlight variations within each experiment.
  • Figure 5: Location identification accuracy w.r.t. the $K$-shot parameter for inference on the test sets from Sec. \ref{['subsubsec:dataset_benchmark']}. The out-of-distribution OPENAIR environments require the most $K$ reference samples.
  • ...and 2 more figures