Table of Contents
Fetching ...

Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

TL;DR

The paper addresses temporal instability in talking-head portrait editing by introducing Follow Your Motion (FYM), a two-stage pipeline that first renders portraits with 3D Gaussian Splatting and then learns motion trajectories across frames via a trajectory-conditioned diffusion model. A multi-resolution 3D hash encoding captures pixel-space motion, guiding diffusion to preserve content consistency, while a dynamic re-weighted attention mechanism using landmark loss enhances fine-grained temporal facial expressions. The framework also includes a ControlNet-based adaptation to fine-tune to edited portraits, enabling robust integration with existing editing tools. Across experiments on a monocular portrait dataset, FYM achieves superior temporal coherence and prompt alignment compared with state-of-the-art baselines, with ablations confirming the contributions of the hash-based motion learning and landmark-aware weighting. The work provides a practical, edit-friendly approach for applications in text-driven editing, relighting, and other portrait-centric video tasks.

Abstract

Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.

Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance

TL;DR

The paper addresses temporal instability in talking-head portrait editing by introducing Follow Your Motion (FYM), a two-stage pipeline that first renders portraits with 3D Gaussian Splatting and then learns motion trajectories across frames via a trajectory-conditioned diffusion model. A multi-resolution 3D hash encoding captures pixel-space motion, guiding diffusion to preserve content consistency, while a dynamic re-weighted attention mechanism using landmark loss enhances fine-grained temporal facial expressions. The framework also includes a ControlNet-based adaptation to fine-tune to edited portraits, enabling robust integration with existing editing tools. Across experiments on a monocular portrait dataset, FYM achieves superior temporal coherence and prompt alignment compared with state-of-the-art baselines, with ablations confirming the contributions of the hash-based motion learning and landmark-aware weighting. The work provides a practical, edit-friendly approach for applications in text-driven editing, relighting, and other portrait-centric video tasks.

Abstract

Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.

Paper Structure

This paper contains 14 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Compared to other state-of-the-art 2D portrait generation methods, our approach directly controls the motion of the edited result by using multi-scale motion trajectories of pixel points as guidance, ensuring temporal consistency. The other methods in the figure can be seen as special cases of our approach. Additionally, our method supports various editing tools and optimizes the temporal consistency of the results produced by these tools.
  • Figure 2: Overview. FYM begins with an efficient 3DGS model to render temporally consistent portraits (Sec \ref{['sec:3dgs']}). Next, we develop a diffusion model that intuitively and inherently learns the motion trajectories changes at different scales and pixel coordinates from the original video frames (Sec \ref{['sec:OCTCM']}). Finally, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark coordinates in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions (Sec \ref{['sec:dram']}).
  • Figure 3: The schematic diagram of encoding spatial points using multi-resolution 3D hash encoding.
  • Figure 4: Quantitative results. Compared to state-of-the-art methods, our approach is able to generate high-quality, temporally consistent editing results that align with the prompt. For a clearer comparison of the temporal consistency across different methods(you can enlarge the image to make a comparison). You can refer to the demo in the supplementary materials.
  • Figure 5: Ablation experiment results of Multi-resolution Hash Encoding (MHE).
  • ...and 2 more figures