Table of Contents
Fetching ...

How Video Meetings Change Your Expression

Sumit Sarin, Utkarsh Mall, Purva Tendulkar, Carl Vondrick

TL;DR

This work introduces FacET, a generative, interpretable framework to uncover spatio temporal differences in facial expressions across two domains VC and F2F using unpaired video data. It combines a beta VAE based spatial disentanglement with a translation function that applies per chunk shift and scale transformations to latent codes, guided by an adversarial objective to capture diverse domain differences. The approach enables detailed, input dependent reports of how expressions differ, supports unsupervised temporal change point discovery, and can perform domain transfer to generate de zoomified videos that resemble F2F interactions. Experiments on the ZoomIn and presidential datasets demonstrate both qualitative and quantitative insights, and the method offers a practical tool for analyzing and improving virtual communication experiences.

Abstract

Do our facial expressions change when we speak over video calls? Given two unpaired sets of videos of people, we seek to automatically find spatio-temporal patterns that are distinctive of each set. Existing methods use discriminative approaches and perform post-hoc explainability analysis. Such methods are insufficient as they are unable to provide insights beyond obvious dataset biases, and the explanations are useful only if humans themselves are good at the task. Instead, we tackle the problem through the lens of generative domain translation: our method generates a detailed report of learned, input-dependent spatio-temporal features and the extent to which they vary between the domains. We demonstrate that our method can discover behavioral differences between conversing face-to-face (F2F) and on video-calls (VCs). We also show the applicability of our method on discovering differences in presidential communication styles. Additionally, we are able to predict temporal change-points in videos that decouple expressions in an unsupervised way, and increase the interpretability and usefulness of our model. Finally, our method, being generative, can be used to transform a video call to appear as if it were recorded in a F2F setting. Experiments and visualizations show our approach is able to discover a range of behaviors, taking a step towards deeper understanding of human behaviors.

How Video Meetings Change Your Expression

TL;DR

This work introduces FacET, a generative, interpretable framework to uncover spatio temporal differences in facial expressions across two domains VC and F2F using unpaired video data. It combines a beta VAE based spatial disentanglement with a translation function that applies per chunk shift and scale transformations to latent codes, guided by an adversarial objective to capture diverse domain differences. The approach enables detailed, input dependent reports of how expressions differ, supports unsupervised temporal change point discovery, and can perform domain transfer to generate de zoomified videos that resemble F2F interactions. Experiments on the ZoomIn and presidential datasets demonstrate both qualitative and quantitative insights, and the method offers a practical tool for analyzing and improving virtual communication experiences.

Abstract

Do our facial expressions change when we speak over video calls? Given two unpaired sets of videos of people, we seek to automatically find spatio-temporal patterns that are distinctive of each set. Existing methods use discriminative approaches and perform post-hoc explainability analysis. Such methods are insufficient as they are unable to provide insights beyond obvious dataset biases, and the explanations are useful only if humans themselves are good at the task. Instead, we tackle the problem through the lens of generative domain translation: our method generates a detailed report of learned, input-dependent spatio-temporal features and the extent to which they vary between the domains. We demonstrate that our method can discover behavioral differences between conversing face-to-face (F2F) and on video-calls (VCs). We also show the applicability of our method on discovering differences in presidential communication styles. Additionally, we are able to predict temporal change-points in videos that decouple expressions in an unsupervised way, and increase the interpretability and usefulness of our model. Finally, our method, being generative, can be used to transform a video call to appear as if it were recorded in a F2F setting. Experiments and visualizations show our approach is able to discover a range of behaviors, taking a step towards deeper understanding of human behaviors.
Paper Structure (22 sections, 7 equations, 9 figures, 4 tables)

This paper contains 22 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: What is the difference between the two domains? Given two unpaired sets of videos of persons speaking on VC (left) and F2F (right), our goal is to provide interpretable insights on how the motion sequence differs between the domains. See Figs. \ref{['fig:results-faces']} & \ref{['fig:main']} for a detailed report generated by our approach.
  • Figure 2: Insufficiency of discriminative methods. We train a simple linear classifier on disentangled facial features to distinguish VC and F2F conversations (refer to \ref{['fig:results-faces']} for the meanings of these features). We observe 88% classification accuracy even when using frames without the temporal information, with 'Head Pitch' (#1) and 'Head Tilt' (#10) being the most dominant features. A post-hoc explainability model cannot explain this discriminative model as it is trained on biased data.
  • Figure 3: Explanation of Disentangled Latents. We vary each dimension of the 12-dimensional latent obtained through $\beta$-VAE encoding by keeping other dimensions fixed (rows). Left: The faces corresponding to the extreme values of the perturbed latent, along with a description of the dominant change. Right: Dataset-level statistics for the desired latent, while the dotted line shows the modes. The direction and length of the arrows show the extent to which different latents change across domains on average. Please refer to the supplementary for videos visualizing the latents.
  • Figure 4: Method Overview. Given a sequence of facial keypoints $x$, we use a pre-trained $\beta$-VAE encoder to obtain latents $z$. We then train a translation function $G_{XY}$ that takes as input the latents $z$ to produce a translator ($\omega$, $\phi$, $\tau$). This translator when applied to $z$ generates the transformed latent $z'$, which can be decoded using the $\beta$-VAE decoder to obtain the transformed facial keypoints $y'$ belonging in the new domain.
  • Figure 5: Results. We showcase our key findings of domain differences through a detailed report. Left: Each row corresponds to two different examples belonging to a specific translator cluster (e.g.,"Speaking with a smile"). The top row shows a video recorded in VC while the bottom row shows the transformed video as if it were F2F. Right: FacET generates a report for each corresponding cluster showing how each $\beta$-VAE latent varies across the domains. Refer to Fig. \ref{['fig:results-faces']} to interpret each latent index. See Sec. \ref{['ssec:quali']} for detailed explanations. Please refer to supplementary for full videos.
  • ...and 4 more figures