Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets

Denise Moussa; Germans Hirsch; Sebastian Wankerl; Christian Riess

Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets

Denise Moussa, Germans Hirsch, Sebastian Wankerl, Christian Riess

TL;DR

This paper tackles the problem of locating splice points in speech audio under unconstrained conditions by reframing localisation as a pointer task. It introduces SigPointer, a Transformer-based pointer network that operates on continuous input signals $\mathbf{S}$ and points to splice positions using a memory $\mathbf{H}$ and per-step vectors $\mathbf{z}_{t^{*}}$, producing index distributions $\hat{\mathbf{p}}_t$ and final positions $\hat{y}_t$. The model is trained with a cosine-distance loss $d_c$ and a three-stage curriculum on data generated from anechoic sources with post-processing, achieving robust performance even under compression and noise. Empirical results show SigPointer, especially in its optimized form SigPointer*, outperforms several baselines across exact and coarse localisation tasks, with improvements of roughly $6$ to $10$ percentage points in key metrics on challenging datasets, and maintains strong performance under out-of-distribution processing chains. The work demonstrates that pointer mechanisms can effectively localise audio splices in continuous signals with relatively small models, offering a practical tool for forensic analysis and digital integrity verification.

Abstract

Verifying the integrity of voice recording evidence for criminal investigations is an integral part of an audio forensic analyst's work. Here, one focus is on detecting deletion or insertion operations, so called audio splicing. While this is a rather easy approach to alter spoken statements, careful editing can yield quite convincing results. For difficult cases or big amounts of data, automated tools can support in detecting potential editing locations. To this end, several analytical and deep learning methods have been proposed by now. Still, few address unconstrained splicing scenarios as expected in practice. With SigPointer, we propose a pointer network framework for continuous input that uncovers splice locations naturally and more efficiently than existing works. Extensive experiments on forensically challenging data like strongly compressed and noisy signals quantify the benefit of the pointer mechanism with performance increases between about 6 to 10 percentage points.

Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets

TL;DR

and points to splice positions using a memory

and per-step vectors

, producing index distributions

and final positions

. The model is trained with a cosine-distance loss

and a three-stage curriculum on data generated from anechoic sources with post-processing, achieving robust performance even under compression and noise. Empirical results show SigPointer, especially in its optimized form SigPointer*, outperforms several baselines across exact and coarse localisation tasks, with improvements of roughly

percentage points in key metrics on challenging datasets, and maintains strong performance under out-of-distribution processing chains. The work demonstrates that pointer mechanisms can effectively localise audio splices in continuous signals with relatively small models, offering a practical tool for forensic analysis and digital integrity verification.

Abstract

Paper Structure (15 sections, 2 figures, 1 table)

This paper contains 15 sections, 2 figures, 1 table.

Introduction
Existing Approaches to Audio Splicing Localisation
Audio Splicing Localisation via Pointer Mechanisms
Proposed Method
SigPointer for Continuous Input Signals
Choosing Model and Training Parameters
Datasets and Training Strategy
Experiments
Baseline Models
Performance Metrics
Evaluation of the Pointer Mechanism
Influence of Splices per Input
Robustness to Complex Processing Chains
Limitations
Conclusions

Figures (2)

Figure 1: SigPointer model for locating splices in audio signals
Figure 2: Jaccard index J (mean of 5 training runs) for $n \in [0,5]$ splices (Figure \ref{['subfig:tts-per-splice-a']}-\ref{['subfig:tts-per-splice-d']}) and robustness towards out-of-distribution multi-compression and real noise post-processing (Fig. \ref{['subfig:compr_bin1']}-\ref{['subfig:noise_bin4']}). The pointer framework (red tones) is clearly superior in all tests.

Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets

TL;DR

Abstract

Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets

Authors

TL;DR

Abstract

Table of Contents

Figures (2)