Table of Contents
Fetching ...

TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles

Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan, Zhipeng Hu, Zhidong Deng, Xin Yu

TL;DR

TalkCLIP addresses the need for expressive talking-head generation guided by natural language without reference videos. It introduces TA-MEAD, a richly annotated text dataset, and a CLIP-based T2SS encoder with an adapter, trained under video-guided supervision from a V2SS teacher to map text to speaking styles. The framework combines audio-driven synthesis with text-guided style control, enabling style editing and intensity modulation while preserving identity and lip-sync. Experimental results on MEAD, HDTF, and VoxCeleb2 demonstrate competitive performance and strong generalization to out-of-domain prompts, highlighting the practicality of text-based expressive control for real-world applications.

Abstract

Audio-driven talking head generation has drawn growing attention. To produce talking head videos with desired facial expressions, previous methods rely on extra reference videos to provide expression information, which may be difficult to find and hence limits their usage. In this work, we propose TalkCLIP, a framework that can generate talking heads where the expressions are specified by natural language, hence allowing for specifying expressions more conveniently. To model the mapping from text to expressions, we first construct a text-video paired talking head dataset where each video has diverse text descriptions that depict both coarse-grained emotions and fine-grained facial movements. Leveraging the proposed dataset, we introduce a CLIP-based style encoder that projects natural language-based descriptions to the representations of expressions. TalkCLIP can even infer expressions for descriptions unseen during training. TalkCLIP can also use text to modulate expression intensity and edit expressions. Extensive experiments demonstrate that TalkCLIP achieves the advanced capability of generating photo-realistic talking heads with vivid facial expressions guided by text descriptions.

TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles

TL;DR

TalkCLIP addresses the need for expressive talking-head generation guided by natural language without reference videos. It introduces TA-MEAD, a richly annotated text dataset, and a CLIP-based T2SS encoder with an adapter, trained under video-guided supervision from a V2SS teacher to map text to speaking styles. The framework combines audio-driven synthesis with text-guided style control, enabling style editing and intensity modulation while preserving identity and lip-sync. Experimental results on MEAD, HDTF, and VoxCeleb2 demonstrate competitive performance and strong generalization to out-of-domain prompts, highlighting the practicality of text-based expressive control for real-world applications.

Abstract

Audio-driven talking head generation has drawn growing attention. To produce talking head videos with desired facial expressions, previous methods rely on extra reference videos to provide expression information, which may be difficult to find and hence limits their usage. In this work, we propose TalkCLIP, a framework that can generate talking heads where the expressions are specified by natural language, hence allowing for specifying expressions more conveniently. To model the mapping from text to expressions, we first construct a text-video paired talking head dataset where each video has diverse text descriptions that depict both coarse-grained emotions and fine-grained facial movements. Leveraging the proposed dataset, we introduce a CLIP-based style encoder that projects natural language-based descriptions to the representations of expressions. TalkCLIP can even infer expressions for descriptions unseen during training. TalkCLIP can also use text to modulate expression intensity and edit expressions. Extensive experiments demonstrate that TalkCLIP achieves the advanced capability of generating photo-realistic talking heads with vivid facial expressions guided by text descriptions.
Paper Structure (15 sections, 4 equations, 9 figures, 3 tables)

This paper contains 15 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Results of TalkCLIP. Given natural language text that describes the desired speaking style, TalkCLIP can produce audio-driven talking head videos with the specified speaking style. The text can be unseen during training (Red texts indicate unseen words in training).
  • Figure 2: (a) The two types of video annotation. We recruit annotators to annotate the emotion of emotion-consistent video groups and obtain the emotion annotation table. We utilize an off-the-shelf AU detector to detect the AU intensity of each video and obtain the video AU annotation table. (b) The pipeline of automatically constructing the description sentence for the video.
  • Figure 3: TalkCLIP pipeline. The text-to-speaking-style (T2SS) encoder can use natural language text to predict speaking style. By integrating the T2SS encoder with other modules, TalkCLIP can generate talking heads with text-guided speaking styles. To increase the alignment between text and predicted styles, we introduced a video-to-speaking-style encoder, which predicts speaking styles from video, to guide the training of the T2SS encoder.
  • Figure 4: Qualitative comparisons. Note that the speaking style of our method is derived from text description. The style reference text is A woman feels ecstatic and speaks with fairly lifted cheek, strongly raised outer brow, and lip corner fully pulled.
  • Figure 5: Qualitative results of the ablation study. The speaking style is inferred from the out-of-domain text "A woman screwed up her face".
  • ...and 4 more figures