Table of Contents
Fetching ...

CausalVE: Face Video Privacy Encryption via Causal Video Prediction

Yubo Huang, Wenhao Feng, Xin Lai, Zixi Wang, Jingzehua Xu, Shuai Zhang, Hongjie He, Fan Chen

TL;DR

A neural network framework, CausalVE, is proposed that has good security in public video dissemination and outperforms state-of-the-art methods from a qualitative, quantitative, and visual point of view.

Abstract

Advanced facial recognition technologies and recommender systems with inadequate privacy technologies and policies for facial interactions increase concerns about bioprivacy violations. With the proliferation of video and live-streaming websites, public-face video distribution and interactions pose greater privacy risks. Existing techniques typically address the risk of sensitive biometric information leakage through various privacy enhancement methods but pose a higher security risk by corrupting the information to be conveyed by the interaction data, or by leaving certain biometric features intact that allow an attacker to infer sensitive biometric information from them. To address these shortcomings, in this paper, we propose a neural network framework, CausalVE. We obtain cover images by adopting a diffusion model to achieve face swapping with face guidance and use the speech sequence features and spatiotemporal sequence features of the secret video for dynamic video inference and prediction to obtain a cover video with the same number of frames as the secret video. In addition, we hide the secret video by using reversible neural networks for video hiding so that the video can also disseminate secret data. Numerous experiments prove that our CausalVE has good security in public video dissemination and outperforms state-of-the-art methods from a qualitative, quantitative, and visual point of view.

CausalVE: Face Video Privacy Encryption via Causal Video Prediction

TL;DR

A neural network framework, CausalVE, is proposed that has good security in public video dissemination and outperforms state-of-the-art methods from a qualitative, quantitative, and visual point of view.

Abstract

Advanced facial recognition technologies and recommender systems with inadequate privacy technologies and policies for facial interactions increase concerns about bioprivacy violations. With the proliferation of video and live-streaming websites, public-face video distribution and interactions pose greater privacy risks. Existing techniques typically address the risk of sensitive biometric information leakage through various privacy enhancement methods but pose a higher security risk by corrupting the information to be conveyed by the interaction data, or by leaving certain biometric features intact that allow an attacker to infer sensitive biometric information from them. To address these shortcomings, in this paper, we propose a neural network framework, CausalVE. We obtain cover images by adopting a diffusion model to achieve face swapping with face guidance and use the speech sequence features and spatiotemporal sequence features of the secret video for dynamic video inference and prediction to obtain a cover video with the same number of frames as the secret video. In addition, we hide the secret video by using reversible neural networks for video hiding so that the video can also disseminate secret data. Numerous experiments prove that our CausalVE has good security in public video dissemination and outperforms state-of-the-art methods from a qualitative, quantitative, and visual point of view.
Paper Structure (26 sections, 28 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 26 sections, 28 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: CausalVE Network Framewrok
  • Figure 2: Architecture of CausalVE for Cover Video Generation. After segmenting the initial video, CausalVE selects images for face-swapping at regular intervals. The Interaction Module then uses a single cover image to generate a complete cover video sequence, guided by the voice sequence from the initial video.
  • Figure 3: The CNN-ViST-CNN Frameworks for Video Prediction in our Re-prediction and Decision Module. We utilize CNNs as the encoder for extracting spatial features and as the decoder for post-video frame prediction. And Video Swin Transformer (ViST) is employed as the translator to learn both temporal and spatial evolution.
  • Figure 4: Video Hiding and Recovery Framework. Given a hidden video $x_{secret}$ and a cover video $x_{cover}$ after forward hiding frame by frame, a pseudo-video $x_{stego}$ is generated. Inversely, with the same reversible neural network and same parameters, the pseudo-video $x_{stego}$ can be recovered into the original video $x_{recover}$.
  • Figure 5: Overall comparison and steganalysis results.
  • ...and 1 more figures