Table of Contents
Fetching ...

Jump Cut Smoothing for Talking Heads

Xiaojuan Wang, Taesung Park, Yang Zhou, Eli Shechtman, Richard Zhang

TL;DR

Jump Cut Smoothing for Talking Heads tackles the problem of abrupt jump cuts in talking head videos by synthesizing seamless intermediate frames guided by interpolated DensePose keypoints. It introduces cross-model attention to fuse appearances from multiple source frames with a target pose, plus recursive and blended synthesis strategies to bridge end frames under complex motion. The approach outperforms frame-interpolation baselines like FILM in realism and identity preservation, especially during large head rotations, and demonstrates practical applicability to filler-word removal with preserved audio flow. This work advances video editing by enabling realistic, motion-consistent jump-cut smoothing using a DensePose-guided, multi-source synthesis framework with scalable attention across sources.

Abstract

A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.

Jump Cut Smoothing for Talking Heads

TL;DR

Jump Cut Smoothing for Talking Heads tackles the problem of abrupt jump cuts in talking head videos by synthesizing seamless intermediate frames guided by interpolated DensePose keypoints. It introduces cross-model attention to fuse appearances from multiple source frames with a target pose, plus recursive and blended synthesis strategies to bridge end frames under complex motion. The approach outperforms frame-interpolation baselines like FILM in realism and identity preservation, especially during large head rotations, and demonstrates practical applicability to filler-word removal with preserved audio flow. This work advances video editing by enabling realistic, motion-consistent jump-cut smoothing using a DensePose-guided, multi-source synthesis framework with scalable attention across sources.

Abstract

A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.
Paper Structure (17 sections, 1 equation, 13 figures, 2 tables)

This paper contains 17 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Jump cut smoothing for filler words removal. Given a talking head video, we remove the filler words and repetitive words (text in red color), and create a seamless transition for the jump cut as shown in the second row.
  • Figure 2: Method overview. In the training stage, we randomly sample source (denoted in green rectangle) and target (denoted in red rectangle) frames, and extract their corresponding DensePose keypoints augmented with facial landmarks (not shown here for simplicity). Our method extracts source dense keypoint features as key, target dense keypoint feature as query, and source image features as value, then a cross attention is applied to get the values for the query, i.e., warped feature. This warped feature is fed into the generator inspired from Co-Mod GAN to synthesize a realistic target image compared with the ground truth target frame. For applying jump cut smoothing in the inference stage, we interpolate dense keypoints between jump cut end frames, and synthesize the transition frame with the interpolated keypoints (in yellow rectangle) sequence.
  • Figure 3: Image animation methods cannot be applied for jump cut smoothing. Row#1: Single image animation works (FaceVid2Vid wang2021one, Face2Face$^\rho$yang2022face2face) animate one of the cut end frames according to the key points sequence, neglecting the other end frame; Row#2: Other works (FOMM siarohin2019first, ImplicitWarping mallya2022implicit) require a driving video for motion extraction, which is absent in our scenario. Row#3: Our approach utilizes at least two cut end frames to generate the transition (shown in orange).
  • Figure 4: Visualization of learned correspondence with our attention mechanism. The top left is our synthesized image given the other three images as sources. We highlight the locations in the synthesized image where the peak attention score $\ge 0.75$, and show their learned corresponding locations (marked with same color) in the source images. Our attention picks appropriate feature from different sources per location, e.g., for the blue point in the lower eyelid, our attention learned to associate with the eyelid feature in the bottom right source image.
  • Figure 5: Recursive synthesis. To fill in a jump cut with smooth, intermediate frames, we recursively fill in frames from the end towards the middle.
  • ...and 8 more figures