Table of Contents
Fetching ...

A Noval Feature via Color Quantisation for Fake Audio Detection

Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Yukun Liu, Guanjun Li, Xin Qi, Yi Lu, Xuefei Liu, Yongwei Li

TL;DR

The paper tackles the interpretability gap in reconstruction-based fake audio detection by introducing a recolor network that enforces color quantisation on spectral image-like inputs. It presents a two-module approach comprising a Pixel Mapping Module with a UNeXt encoder for per-pixel classification and a Palette Module to derive a compact color palette, enabling colored, interpretable reconstructions. Empirical results on the ASVspoof2019 Logical Access dataset show that color-quantised recolor features can improve detection performance over original spectral inputs, and pretraining the recolor network further enhances results. The work offers a practical, interpretable feature extraction pathway for FAD that can be integrated with existing backbones and training pipelines.

Abstract

In the field of deepfake detection, previous studies focus on using reconstruction or mask and prediction methods to train pre-trained models, which are then transferred to fake audio detection training where the encoder is used to extract features, such as wav2vec2.0 and Masked Auto Encoder. These methods have proven that using real audio for reconstruction pre-training can better help the model distinguish fake audio. However, the disadvantage lies in poor interpretability, meaning it is hard to intuitively present the differences between deepfake and real audio. This paper proposes a noval feature extraction method via color quantisation which constrains the reconstruction to use a limited number of colors for the spectral image-like input. The proposed method ensures reconstructed input differs from the original, which allows for intuitive observation of the focus areas in the spectral reconstruction. Experiments conducted on the ASVspoof2019 dataset demonstrate that the proposed method achieves better classification performance compared to using the original spectral as input and pretraining the recolor network can also benefit the fake audio detection.

A Noval Feature via Color Quantisation for Fake Audio Detection

TL;DR

The paper tackles the interpretability gap in reconstruction-based fake audio detection by introducing a recolor network that enforces color quantisation on spectral image-like inputs. It presents a two-module approach comprising a Pixel Mapping Module with a UNeXt encoder for per-pixel classification and a Palette Module to derive a compact color palette, enabling colored, interpretable reconstructions. Empirical results on the ASVspoof2019 Logical Access dataset show that color-quantised recolor features can improve detection performance over original spectral inputs, and pretraining the recolor network further enhances results. The work offers a practical, interpretable feature extraction pathway for FAD that can be integrated with existing backbones and training pipelines.

Abstract

In the field of deepfake detection, previous studies focus on using reconstruction or mask and prediction methods to train pre-trained models, which are then transferred to fake audio detection training where the encoder is used to extract features, such as wav2vec2.0 and Masked Auto Encoder. These methods have proven that using real audio for reconstruction pre-training can better help the model distinguish fake audio. However, the disadvantage lies in poor interpretability, meaning it is hard to intuitively present the differences between deepfake and real audio. This paper proposes a noval feature extraction method via color quantisation which constrains the reconstruction to use a limited number of colors for the spectral image-like input. The proposed method ensures reconstructed input differs from the original, which allows for intuitive observation of the focus areas in the spectral reconstruction. Experiments conducted on the ASVspoof2019 dataset demonstrate that the proposed method achieves better classification performance compared to using the original spectral as input and pretraining the recolor network can also benefit the fake audio detection.
Paper Structure (19 sections, 2 figures, 3 tables)

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Whole architecture of proposed method. In (a), we demonstrate how the reconstruction stage work and both the training stage and the inference stage for FAD task. In (b), we show the details in the Palette Acquisition Module.
  • Figure 2: The reconstruction results of the pretrained recolor models for different time segments of the same one sample in VCTK dataset, with varying numbers of color and temperature settings.