Table of Contents
Fetching ...

DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization

Xiaodong Zhu, Suting Wang, Yuanming Zheng, Junqi Yang, Yangxu Liao, Yuhong Yang, Weiping Tu, Zhongyuan Wang

TL;DR

DeformTrace is proposed, which enhances SSMs with deformable dynamics and relay mechanisms to address challenges of temporal Forgery Localization and achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.

Abstract

Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.

DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization

TL;DR

DeformTrace is proposed, which enhances SSMs with deformable dynamics and relay mechanisms to address challenges of temporal Forgery Localization and achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.

Abstract

Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.
Paper Structure (18 sections, 6 equations, 4 figures, 4 tables)

This paper contains 18 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our main contributions (we take the video sequence for illustration). (a) Vanilla SSM; (b) Deformable Self-SSM with learnable temporal offsets for flexible local sampling; (c) Visualization of hidden attention, where relay tokens expand receptive fields and maintain long-range dependencies; (d) Deformable Cross-SSM enables cross-sequence interactions by allowing each query token to partition the global state space into subspaces.
  • Figure 2: Illustration of the overall scheme of DeformTrace. Built on TadTR liu2022end, DeformTrace integrates a multi-scale audio-visual feature extraction module, a deformable encoder for temporal modeling, and a deformable decoder for forgery localization and video-level classification. By incorporating deformable self- and cross-SSM modules, it combines Mamba’s efficient state updates with Transformer's global modeling. Relay tokens with enhanced and cooperation losses help preserve long-range information dependencies during self-scanning.
  • Figure 3: Robustness evaluation under various compression and degradation scenarios. The experiments include 6 visual distortions (Block-wise, Color Contrast, Gaussian Noise (video), Gaussian Blur, JPEG Compression and Video Compression) and 4 audio distortions (Gaussian Noise (audio), Reverberation, Pitch Shift and Audio Compression). In the figure, colors denote methods; solid lines show mAP across intensities, dashed lines show mAP on clean videos. DeformTrace achieves the highest mAP on clean videos and demonstrates strong robustness across various distortion scenarios.
  • Figure 4: (a) Ablation study on the number of relay tokens (avg. video length: 9s). (b) Performance vs. video duration.