Table of Contents
Fetching ...

ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, Shangfei Wang

TL;DR

ConsistTalk tackles flickering, identity drift, and audio-visual misalignment in diffusion-based talking head generation by decoupling motion from appearance via an Optical Flow Guided Temporal Module and modeling motion intensity with a multimodal Audio-to-Intensity framework. It introduces IC-Init, a training-free diffusion-inference strategy guided by intensity to improve identity preservation and motion continuity. Extensive experiments on HDTF and other datasets show significant improvements in temporal stability, lip-sync accuracy, and motion diversity over prior state-of-the-art methods. The work provides a practical path toward controllable, high-fidelity, long-form talking head video generation.

Abstract

Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.

ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

TL;DR

ConsistTalk tackles flickering, identity drift, and audio-visual misalignment in diffusion-based talking head generation by decoupling motion from appearance via an Optical Flow Guided Temporal Module and modeling motion intensity with a multimodal Audio-to-Intensity framework. It introduces IC-Init, a training-free diffusion-inference strategy guided by intensity to improve identity preservation and motion continuity. Extensive experiments on HDTF and other datasets show significant improvements in temporal stability, lip-sync accuracy, and motion diversity over prior state-of-the-art methods. The work provides a practical path toward controllable, high-fidelity, long-form talking head video generation.

Abstract

Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Failure cases in maintaining visual and temporal consistency of current methods. The left samples exhibit significant visual distortion (a) and background deformation with identity drift (b). The rightmost visualization displays temporal flickering (c) in the optical flows.
  • Figure 2: Overall architecture of the proposed ConsistTalk framework. The system integrates three key modules: (a) a facial optical flow-guided temporal module (OFT) that decouples motion dynamics from appearance features to suppress flicker and enhance temporal consistency; (b) a audio-to-intensity (A2I) model that transforms audio and facial velocity features into frame-wise intensity signals for fine-grained motion control; and (c) a noise initialization strategy (IC-Init) that stabilizes the generation process by leveraging a noise search procedure.
  • Figure 3: Audio-to-Intensity module and IC-Init. The left side presents the detailed structure of the audio-to-intensity module, and on the right is the inference-time noise search strategy IC-Init based on intensity-guided frequency decomposition.
  • Figure 4: Qualitative comparisons with State-of-the-Art talking head generation methods on HDTF hdtf dataset.
  • Figure 5: Ablation studies. (a): Visual results for ablation on facial optical-guided temporal module OFT. (b): Line chart of the acquired intensity sequence. (c): Qualitative comparison with intensity score manipulation. (d): Qualitative comparison with FreeInit and proposed IC-Init. (e): Comparison of long video generation quality between ConsistTalk and baseline Hallo.