Table of Contents
Fetching ...

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan

TL;DR

Kling-Avatar tackles the challenge of semantically grounded, long-duration avatar synthesis conditioned on multimodal inputs. It introduces an MLLM Director to ground instructions into a unified blueprint storyline, followed by a cascaded, parallel sub-clip generation framework guided by blueprint keyframes to produce coherent, expressive videos. The authors build a curated data pipeline and a 375-sample benchmark, along with training and inference strategies that preserve identity and lip synchronization. Empirical results show state-of-the-art performance up to 1080p/48fps and strong generalization to open domains, demonstrating the approach's practical potential for digital humans in live settings. Overall, the work establishes a benchmark and a scalable, instruction-grounded paradigm for semantically controlled, high-fidelity avatar synthesis.

Abstract

Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

TL;DR

Kling-Avatar tackles the challenge of semantically grounded, long-duration avatar synthesis conditioned on multimodal inputs. It introduces an MLLM Director to ground instructions into a unified blueprint storyline, followed by a cascaded, parallel sub-clip generation framework guided by blueprint keyframes to produce coherent, expressive videos. The authors build a curated data pipeline and a 375-sample benchmark, along with training and inference strategies that preserve identity and lip synchronization. Empirical results show state-of-the-art performance up to 1080p/48fps and strong generalization to open domains, demonstrating the approach's practical potential for digital humans in live settings. Overall, the work establishes a benchmark and a scalable, instruction-grounded paradigm for semantically controlled, high-fidelity avatar synthesis.

Abstract

Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

Paper Structure

This paper contains 13 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Conditioned on audio, image, and user prompts, Kling-Avatar generates high-fidelity portrait animations through instruction grounding and semantic planning. The results exhibit vivid emotions, rich actions, and precise lip synchronization, while also showing strong generalization to open scenarios such as anime, cartoons, and stylized characters.
  • Figure 2: Benchmark performance of Kling-Avatar against its counterparts in terms of GSB metrics. We achieve superior performance on the overall metric as well as across most of sub-dimensions.
  • Figure 3: Illustration of Kling-Avatar’s cascaded generation pipeline. An MLLM Director first interprets multimodal instructions into high-level semantics and tells a storyline. Guided by this global planning, the first stage generates a blueprint video. In the second stage, keyframes are extracted from the blueprint and used as first–last frame conditions for parallel sub-clip generation, refining local details and dynamics to synthesize long-duration videos.
  • Figure 4: Overall GSB evaluation results on our benchmark across various dimensions against OmniHuman-1 and HeyGen.
  • Figure 5: Comparison of lip synchronization between Kling-Avatar and baselines. We produce accurate lip movements for characters across different scenarios.
  • ...and 2 more figures