Table of Contents
Fetching ...

RID-TWIN: An end-to-end pipeline for automatic face de-identification in videos

Anirban Mukherjee, Monjoy Narayan Choudhury, Dinesh Babu Jayagopi

TL;DR

A novel pipeline that leverages the state-of-the-art generative models, and decouples identity from motion to perform automatic face de-identification in videos is proposed, and the performance of the methodology on the widely employed VoxCeleb2 dataset is evaluated.

Abstract

Face de-identification in videos is a challenging task in the domain of computer vision, primarily used in privacy-preserving applications. Despite the considerable progress achieved through generative vision models, there remain multiple challenges in the latest approaches. They lack a comprehensive discussion and evaluation of aspects such as realism, temporal coherence, and preservation of non-identifiable features. In our work, we propose RID-Twin: a novel pipeline that leverages the state-of-the-art generative models, and decouples identity from motion to perform automatic face de-identification in videos. We investigate the task from a holistic point of view and discuss how our approach addresses the pertinent existing challenges in this domain. We evaluate the performance of our methodology on the widely employed VoxCeleb2 dataset, and also a custom dataset designed to accommodate the limitations of certain behavioral variations absent in the VoxCeleb2 dataset. We discuss the implications and advantages of our work and suggest directions for future research.

RID-TWIN: An end-to-end pipeline for automatic face de-identification in videos

TL;DR

A novel pipeline that leverages the state-of-the-art generative models, and decouples identity from motion to perform automatic face de-identification in videos is proposed, and the performance of the methodology on the widely employed VoxCeleb2 dataset is evaluated.

Abstract

Face de-identification in videos is a challenging task in the domain of computer vision, primarily used in privacy-preserving applications. Despite the considerable progress achieved through generative vision models, there remain multiple challenges in the latest approaches. They lack a comprehensive discussion and evaluation of aspects such as realism, temporal coherence, and preservation of non-identifiable features. In our work, we propose RID-Twin: a novel pipeline that leverages the state-of-the-art generative models, and decouples identity from motion to perform automatic face de-identification in videos. We investigate the task from a holistic point of view and discuss how our approach addresses the pertinent existing challenges in this domain. We evaluate the performance of our methodology on the widely employed VoxCeleb2 dataset, and also a custom dataset designed to accommodate the limitations of certain behavioral variations absent in the VoxCeleb2 dataset. We discuss the implications and advantages of our work and suggest directions for future research.
Paper Structure (10 sections, 4 figures, 1 table, 1 algorithm)

This paper contains 10 sections, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Sample outputs of RID-Twin: The images from source videos (bottom) and their corresponding generated D-Twins (top)
  • Figure 2: Pipeline of RID-Twin: From an input video $V_S$, we first extract a source image $I_S$. From here, the FaceDetector module detects a face to create a face mask $I_M$. The source image also goes to the ImageCaptioning module to generate a relevant caption $C_S$. The source image, mask, and caption go to the Inpaint module to generate the same image with a new identity $I_D$, i.e. our D-Twin. Finally, we re-enact this D-Twin image based on the motion of our input video using the Re-enact module, to get the output de-identified video $V_D$.
  • Figure 3: Preservation of expression and pose across frames: (From top) De-identified video, Source Video, and Plots of Evaluation metrics: De-identification Level, Identity Consistency, Expression Preservation using Cosine Distance
  • Figure 4: RID-Twin on proposed custom dataset. In this figure, we see two examples of facial expression variation, the top row showing an instance of looking left, and the bottom row showing an instance of the expression 'shocked'. From left, we have frames from the first user's source video, the corresponding de-identified video, the second user's source video, and the corresponding de-identified video.