Table of Contents
Fetching ...

Efficient Temporally-Aware DeepFake Detection using H.264 Motion Vectors

Peter Grönquist, Yufan Ren, Qingyi He, Alessio Verardo, Sabine Süsstrunk

TL;DR

DeepFakes pose security and privacy risks by manipulating facial identity and expression, and most detectors ignore temporal cues by processing frames individually. This paper argues for temporal-aware detection using readily available H.264 motion vectors (MVs) and Information Masks (IMs) as cheap motion proxies, avoiding explicit optical-flow estimation. It introduces a MobileNet-based two-stream architecture that ingests MV/IM data (with optional RGB) and shows on FaceForensics++ that MV/IM models achieve strong generalization and competitive accuracy, outperforming a RAFT-based optical-flow baseline in many settings while requiring far less computation. The approach enables near real-time temporal anomaly detection for video calls and streaming, and highlights opportunities for hardware-accelerated decoding to further reduce latency, while noting limitations such as MV spatial resolution and the need for richer, higher-quality datasets for robust cross-forgery evaluation.

Abstract

Video DeepFakes are fake media created with Deep Learning (DL) that manipulate a person's expression or identity. Most current DeepFake detection methods analyze each frame independently, ignoring inconsistencies and unnatural movements between frames. Some newer methods employ optical flow models to capture this temporal aspect, but they are computationally expensive. In contrast, we propose using the related but often ignored Motion Vectors (MVs) and Information Masks (IMs) from the H.264 video codec, to detect temporal inconsistencies in DeepFakes. Our experiments show that this approach is effective and has minimal computational costs, compared with per-frame RGB-only methods. This could lead to new, real-time temporally-aware DeepFake detection methods for video calls and streaming.

Efficient Temporally-Aware DeepFake Detection using H.264 Motion Vectors

TL;DR

DeepFakes pose security and privacy risks by manipulating facial identity and expression, and most detectors ignore temporal cues by processing frames individually. This paper argues for temporal-aware detection using readily available H.264 motion vectors (MVs) and Information Masks (IMs) as cheap motion proxies, avoiding explicit optical-flow estimation. It introduces a MobileNet-based two-stream architecture that ingests MV/IM data (with optional RGB) and shows on FaceForensics++ that MV/IM models achieve strong generalization and competitive accuracy, outperforming a RAFT-based optical-flow baseline in many settings while requiring far less computation. The approach enables near real-time temporal anomaly detection for video calls and streaming, and highlights opportunities for hardware-accelerated decoding to further reduce latency, while noting limitations such as MV spatial resolution and the need for richer, higher-quality datasets for robust cross-forgery evaluation.

Abstract

Video DeepFakes are fake media created with Deep Learning (DL) that manipulate a person's expression or identity. Most current DeepFake detection methods analyze each frame independently, ignoring inconsistencies and unnatural movements between frames. Some newer methods employ optical flow models to capture this temporal aspect, but they are computationally expensive. In contrast, we propose using the related but often ignored Motion Vectors (MVs) and Information Masks (IMs) from the H.264 video codec, to detect temporal inconsistencies in DeepFakes. Our experiments show that this approach is effective and has minimal computational costs, compared with per-frame RGB-only methods. This could lead to new, real-time temporally-aware DeepFake detection methods for video calls and streaming.
Paper Structure (1 section, 1 figure)

This paper contains 1 section, 1 figure.

Table of Contents

  1. Introduction

Figures (1)

  • Figure 1: Two continuous video frames in the FaceForensics++ dataset, with their corresponding optical flow and H.264 motion vectors, with respect to the precursory frame of each frame. The information masks indicate the availability of motion vectors at a spatial location. (1) and (2) express the color visualization of two dimensional motion information into RGB space. Motion vectors show similar motion as optical flow but are coarser and noisier. (Best viewed on a screen when zoomed in)