Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance
Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
TL;DR
The paper addresses temporal instability in talking-head portrait editing by introducing Follow Your Motion (FYM), a two-stage pipeline that first renders portraits with 3D Gaussian Splatting and then learns motion trajectories across frames via a trajectory-conditioned diffusion model. A multi-resolution 3D hash encoding captures pixel-space motion, guiding diffusion to preserve content consistency, while a dynamic re-weighted attention mechanism using landmark loss enhances fine-grained temporal facial expressions. The framework also includes a ControlNet-based adaptation to fine-tune to edited portraits, enabling robust integration with existing editing tools. Across experiments on a monocular portrait dataset, FYM achieves superior temporal coherence and prompt alignment compared with state-of-the-art baselines, with ablations confirming the contributions of the hash-based motion learning and landmark-aware weighting. The work provides a practical, edit-friendly approach for applications in text-driven editing, relighting, and other portrait-centric video tasks.
Abstract
Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.
