InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Yuchi Wang; Junliang Guo; Jianhong Bai; Runyi Yu; Tianyu He; Xu Tan; Xu Sun; Jiang Bian

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

TL;DR

"InstructAvatar" tackles the challenge of producing emotionally expressive 2D talking avatars with fine-grained, text-guided control. It combines a VAE-based motion-appearance disentanglement with a diffusion-based two-branch generator conditioned on audio and natural language instructions, and introduces an automatic instruction-data pipeline via AU extraction and GPT-4V paraphrase. The method demonstrates improved emotion controllability, lip-sync quality, and naturalness, while supporting open-ended text guidance and cross-domain generalization. The work highlights the potential of natural language interfaces for rich, interactive avatar animation and lays groundwork for future improvements in out-of-domain robustness and multi-attribute control.

Abstract

Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/.

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

TL;DR

Abstract

Paper Structure (51 sections, 7 equations, 11 figures, 11 tables)

This paper contains 51 sections, 7 equations, 11 figures, 11 tables.

Introduction
Related Works
Methodology
Overview
Construct Natural and Diverse Text Instructions
Emotion Label Extension
Action Unit Extraction
MLLM Paraphrase
Text-Guided Motion Generator
Basics for Diffusion Models
Audio-aware Input Block
Audio-aware Input Block
Two-branch Text-aware Denoising Block
Zero Convolution Mechanism for Text Condition
Training and Inference Pipelines
...and 36 more sections

Figures (11)

Figure 1: InstructAvatar enables emotional talking face generation through a flexible natural language interface (top 2 rows). The generated results exhibit fine-grained expression control, excellent identity preservation, high-quality lip sync, and natural character movements. Moreover, it supports direct control of facial motion and expression without relying on audio cues, a feature absent in previous studies (bottom 2 rows). The ability of InstructAvatar to handle highly out-of-domain appearances (like cartoons, sketches, and sculptures) further highlights its generalization capabilities.
Figure 2: Method Overview: The InstructAvatar consists of two components: VAE $\mathcal{H}$ to disentangle motion information from the video and a motion generator $\mathcal{G}$ to generate the motion latent conditioned on audio and instruction. As we have two types of data, two switches in instruction and audio are designed. During inference, the motion encoder in the VAE will be dropped and we iteratively denoise Gaussian noise to obtain the predicted motion latent. Together with the user-provided portrait, the resulting video is generated by the decoder of the VAE.
Figure 3: We extend the emotion label using a predefined template and incorporate intensity information by modifying the emotion with an adverb representing the degree. For fine-grained control, we extract the AUs and then prompt GPT-4V to paraphrase them into a sentence.
Figure 4: Details of motion generator. To denoise noisy motion latent, firstly we element-wise add the audio feature into it. Then, in each denoising block, we design a two-branch cross-attention module to inject emotion and motion control into the model. Lastly, we also incorporate AU and intensity losses to encourage the model to learn them.
Figure 5: Qualitative comparison with baselines. It shows that InstructAvatar achieves well lip-sync quality and emotion controllability. Additionally, the outputs generated by our model exhibit enhanced naturalness and effectively preserve identity characteristics.
...and 6 more figures

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

TL;DR

Abstract

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)