Table of Contents
Fetching ...

Robust Lossy Audio Compression Identification

Hendrik Vincent Koops, Gianluca Micchi, Elio Quinton

TL;DR

This work questions the robustness of blind lossy audio compression identification, showing that models trained on fixed encoder configurations can achieve near-perfect accuracy but fail when codec parameters vary. It introduces a random spectrogram masking strategy to reduce reliance on the codec cutoff frequency, yielding a substantially more robust model that generalises across codecs and bitrates. While performance improves markedly, detection remains more challenging for AAC, suggesting codec-specific artefacts and dataset characteristics influence detectability. The findings have practical implications for quality assurance and archival workflows where codec configurations vary beyond training conditions.

Abstract

Previous research contributions on blind lossy compression identification report near perfect performance metrics on their test set, across a variety of codecs and bit rates. However, we show that such results can be deceptive and may not accurately represent true ability of the system to tackle the task at hand. In this article, we present an investigation into the robustness and generalisation capability of a lossy audio identification model. Our contributions are as follows. (1) We show the lack of robustness to codec parameter variations of a model equivalent to prior art. In particular, when naively training a lossy compression detection model on a dataset of music recordings processed with a range of codecs and their lossless counterparts, we obtain near perfect performance metrics on the held-out test set, but severely degraded performance on lossy tracks produced with codec parameters not seen in training. (2) We propose and show the effectiveness of an improved training strategy to significantly increase the robustness and generalisation capability of the model beyond codec configurations seen during training. Namely we apply a random mask to the input spectrogram to encourage the model not to rely solely on the training set's codec cutoff frequency.

Robust Lossy Audio Compression Identification

TL;DR

This work questions the robustness of blind lossy audio compression identification, showing that models trained on fixed encoder configurations can achieve near-perfect accuracy but fail when codec parameters vary. It introduces a random spectrogram masking strategy to reduce reliance on the codec cutoff frequency, yielding a substantially more robust model that generalises across codecs and bitrates. While performance improves markedly, detection remains more challenging for AAC, suggesting codec-specific artefacts and dataset characteristics influence detectability. The findings have practical implications for quality assurance and archival workflows where codec configurations vary beyond training conditions.

Abstract

Previous research contributions on blind lossy compression identification report near perfect performance metrics on their test set, across a variety of codecs and bit rates. However, we show that such results can be deceptive and may not accurately represent true ability of the system to tackle the task at hand. In this article, we present an investigation into the robustness and generalisation capability of a lossy audio identification model. Our contributions are as follows. (1) We show the lack of robustness to codec parameter variations of a model equivalent to prior art. In particular, when naively training a lossy compression detection model on a dataset of music recordings processed with a range of codecs and their lossless counterparts, we obtain near perfect performance metrics on the held-out test set, but severely degraded performance on lossy tracks produced with codec parameters not seen in training. (2) We propose and show the effectiveness of an improved training strategy to significantly increase the robustness and generalisation capability of the model beyond codec configurations seen during training. Namely we apply a random mask to the input spectrogram to encourage the model not to rely solely on the training set's codec cutoff frequency.
Paper Structure (22 sections, 8 figures, 2 tables)

This paper contains 22 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Basic block diagram of a perceptual audio coder. After spectral decomposition, a psychoacoustic model informs the quantization of individual spectral components.
  • Figure 2: Spectrograms of examples of a lossless (left) and lossy version of the same audio excerpt (right). The latter is compressed with the libfdk_aac codec at 128 kbps bit rate. The version on the right shows the hallmarks of lossy compression: removal of fft coefficients, holes in the spectrum, and general loss of higher frequency content.
  • Figure 3: Proposed model for the detection of lossy audio Our model takes as input 2 seconds of audio, which is passed to a torchaudio spectrogram layer (in green). Depending on the experiment, the spectrogram is then passed to a masking layer (in blue), which simulates low-pass filtering. The spectrogram is then passed to four convolutional modules (in pink). We use a bi-directional lstm (in yellow) for dimensionality reduction. We classify the audio into lossy or lossless in the final model head.
  • Figure 4: Saliency maps from exposing a model trained without (top) and with (bottom) random mask to lossy audio. The model with random mask shows more activation in the holes of the spectrogram without losing any of the activations at the cutoff frequency.
  • Figure 5: F1-score for varying thresholds, evaluated on ds2. Each line analyses the subset made of lossless files as negatives and the specified codec as positives; files encoded with different codecs are discarded. Left: model without random mask; Right: model with random mask.
  • ...and 3 more figures