Efficient Temporally-Aware DeepFake Detection using H.264 Motion Vectors
Peter Grönquist, Yufan Ren, Qingyi He, Alessio Verardo, Sabine Süsstrunk
TL;DR
DeepFakes pose security and privacy risks by manipulating facial identity and expression, and most detectors ignore temporal cues by processing frames individually. This paper argues for temporal-aware detection using readily available H.264 motion vectors (MVs) and Information Masks (IMs) as cheap motion proxies, avoiding explicit optical-flow estimation. It introduces a MobileNet-based two-stream architecture that ingests MV/IM data (with optional RGB) and shows on FaceForensics++ that MV/IM models achieve strong generalization and competitive accuracy, outperforming a RAFT-based optical-flow baseline in many settings while requiring far less computation. The approach enables near real-time temporal anomaly detection for video calls and streaming, and highlights opportunities for hardware-accelerated decoding to further reduce latency, while noting limitations such as MV spatial resolution and the need for richer, higher-quality datasets for robust cross-forgery evaluation.
Abstract
Video DeepFakes are fake media created with Deep Learning (DL) that manipulate a person's expression or identity. Most current DeepFake detection methods analyze each frame independently, ignoring inconsistencies and unnatural movements between frames. Some newer methods employ optical flow models to capture this temporal aspect, but they are computationally expensive. In contrast, we propose using the related but often ignored Motion Vectors (MVs) and Information Masks (IMs) from the H.264 video codec, to detect temporal inconsistencies in DeepFakes. Our experiments show that this approach is effective and has minimal computational costs, compared with per-frame RGB-only methods. This could lead to new, real-time temporally-aware DeepFake detection methods for video calls and streaming.
