ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai; Zitong Yu; Jun Wang; Linlin Shen; Yong Xu; Xiaochun Cao

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

Abstract

Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Abstract

speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

Paper Structure (21 sections, 9 equations, 4 figures, 6 tables)

This paper contains 21 sections, 9 equations, 4 figures, 6 tables.

Introduction
Related Work
Synthetic Image Detection via MLLMs.
Visual Token Compression for Efficient MLLM Inference.
ForensicZip
Preliminary
Forensic VLM pipeline.
Sequence-length complexity.
The Forensic-Semantic Discrepancy
Obs. 1: Inverse correlation between semantic saliency and forensic evidence.
Obs. 2: Temporal discontinuity as Birth--Death events.
Methodology
Transport Novelty Estimation (TNE)
Birth--Death entropic optimal transport.
Per-token score extraction.
...and 6 more sections

Figures (4)

Figure 1: Quantitative analysis of forensic-semantic misalignment. (a) Recall of retained tokens against ground-truth forgery regions under varying compression ratios. Semantic-driven criteria show high sensitivity to compression ratios, while our transport-based novelty score maintains stable coverage of localized anomalies. Shaded regions denote $\pm 1$ standard deviation across evaluation samples. (b) Distribution of temporal Optimal Transport (OT) costs. The distinct high-cost tail in forged videos indicates physical discontinuities (birth-death events), supporting the use of our augmented OT formulation.
Figure 2: Empirical motivation for ForensicZip.(a) Inverse Correlation: Density plot of semantic saliency (cross-modal attention) vs. forensic evidence (IoU with forgery masks) on FakeClue/LOKI. (b) Temporal Discontinuity: KDE of inter-frame matching costs.
Figure 3: Overall framework of ForensicZip. ForensicZip measures transport novelty across adjacent video frames and selectively preserves vision tokens with high physical inconsistency (i.e., transient artifacts), thereby achieving plug-and-play forensic MLLM inference acceleration.
Figure 4: Ablation Study. Impact of (a) joint weights $\lambda/\eta$, (b) regularization $\varepsilon_{\mathrm{ot}}$, (c) penalty $c$, and (d) iterations on FakeClue accuracy.

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Abstract

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Authors

Abstract

Table of Contents

Figures (4)