Table of Contents
Fetching ...

Exploring Timeline Control for Facial Motion Generation

Yifeng Ma, Jinwei Qi, Chaonan Ji, Peng Zhang, Bang Zhang, Zhidong Deng, Liefeng Bo

TL;DR

This work tackles the limitation of coarse timing in facial motion control by introducing timeline control, a labor-efficient method to annotate frame-level facial actions using Toeplitz Inverse Covariance-based Clustering (TICC), and a diffusion-based generation model with a base-branch architecture to produce motions aligned to input timelines. It also enables text-guided generation by converting natural language descriptions into timelines via ChatGPT. The approach achieves accurate, timeline-consistent facial motions on RealTalk data, with strong annotation Macro-F1 scores per region and favorable qualitative results, demonstrating potential for precise, photorealistic digital humans. The combination of fine-grained timeline annotations, region-specific diffusion generation, and text-to-timeline translation represents a significant advance in controllable, naturalistic facial motion synthesis. Overall, the method enables accurate, user-guided, and linguistically expressive control over facial motion timing for applications in digital humans and film.

Abstract

This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.

Exploring Timeline Control for Facial Motion Generation

TL;DR

This work tackles the limitation of coarse timing in facial motion control by introducing timeline control, a labor-efficient method to annotate frame-level facial actions using Toeplitz Inverse Covariance-based Clustering (TICC), and a diffusion-based generation model with a base-branch architecture to produce motions aligned to input timelines. It also enables text-guided generation by converting natural language descriptions into timelines via ChatGPT. The approach achieves accurate, timeline-consistent facial motions on RealTalk data, with strong annotation Macro-F1 scores per region and favorable qualitative results, demonstrating potential for precise, photorealistic digital humans. The combination of fine-grained timeline annotations, region-specific diffusion generation, and text-to-timeline translation represents a significant advance in controllable, naturalistic facial motion synthesis. Overall, the method enables accurate, user-guided, and linguistically expressive control over facial motion timing for applications in digital humans and film.

Abstract

This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.

Paper Structure

This paper contains 10 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: We introduce a new control signal for facial motion generation: timeline control. We first utilize a labor-efficient approach to annotate the time intervals of facial motion at a frame-level granularity. Using the annotations, we propose a model that can generate natural facial motions aligned with an input timeline. Compared to previous controls like audio and text, timeline control enables precise temporal control of facial motions. In this paper, facial motions are rendered into photorealistic videos for better visualization.
  • Figure 2: The pipeline of frame-level facial motion annotation (using brow motions as an example). We first extract facial motion descriptors (blendshapes) from natural facial motion videos and concatenate the results to create a facial motion time series for time series analysis. This analysis can simultaneously segment the sequence into a series of motion patterns and cluster similar patterns, resulting in multiple clusters, each containing consistent facial motion patterns. Then, by inspecting a few patterns, we identify the facial motions each cluster represents, thereby obtaining frame-level facial motion annotations for all videos.
  • Figure 3: Illustration of generation model. (a) Base-Branch Design. The base network takes the timelines of all facial regions as input and outputs base features that model the global facial motion couplings. Through timeline selection, each region's timeline is directed to its respective branch network. Since head pose is interconnected with all facial movements, the pose branch receives timelines of all regions. Each branch network takes the timeline of its corresponding region to generate the facial motions for that region. These motions are then combined to produce the overall motion of the entire face. Lin. Proj. denotes Linear Projection. (b) Base/Branch Network's Architecture. Timeline control guides motion generation through cross-attention. The initial timeline tokens remain unchanged and are added at each layer. For clarity, the diffusion step (omitted in sub-figure (a) for clarity) is applied to each base and branch network.
  • Figure 4: An example of brow motion annotation.
  • Figure 5: An example of eye motion annotation.
  • ...and 6 more figures