Table of Contents
Fetching ...

GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement

Xiaodong Zhu, Yuanming Zheng, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, Zhongyuan Wang

TL;DR

GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference, is proposed.

Abstract

Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.

GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement

TL;DR

GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference, is proposed.

Abstract

Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.
Paper Structure (15 sections, 16 equations, 4 figures, 4 tables)

This paper contains 15 sections, 16 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between prior and our WS-TFL pipelines. (a) Prior methods xu2025multimodalxu2025weaklysupervisedmultimodaltemporal: trained with clip-level labels but required for temporal localization, leading to noisy proposals due to the training–inference mismatch. (b) Ours: a two-phase design aligning classification and regression to refine coarse predictions for precise boundary localization.
  • Figure 2: Overview of the classification phase. (a) Label Attribute Decomposition: The feature enhancement module aligns and fuses audio–visual features, after which the attention and attribute branches produce frame-level attention and attribute predictions optimized through EM to capture diverse forgery patterns. (b) Temporal Consistency Refinement: The non-differentiable top-$k$ operation blocks gradient flow between the attention and attribute branches, causing inconsistent temporal responses. To address this, frame-level attribute predictions are alternately projected onto the row constraint (attention-weighted alignment between frame- and clip-level predictions) and the column constraint (categorical distribution). The refined attribute predictions are then used to generate initial pseudo proposals. (c) Graph-based Proposal Refinement: Proposals are mapped into a unified space, where a proposal graph is constructed and confidence values are diffused to obtain fusion weights. These are integrated and thresholded at zero to produce the final pseudo proposals, merging fragmented proposals (e.g., $p_1$, $p_2$, $p_3$) into continuous ones (e.g., $p_{\text{fuse}}$) from a global and relational perspective.
  • Figure 3: Comparison between prior and our WS-TFL pipelines.
  • Figure 4: T-SNE visualizations of classification features under different latent forgery attribute d $m$. Overlaps between attributes are highlighted in gray. “binary” denotes the setting without EM optimization, where the model directly performs binary classification.