Table of Contents
Fetching ...

LPM 1.0: Video-based Character Performance Model

Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, Sheng Bi, Steven Tang, Thorn Hang, Tobey Guo, Vincent Li, Xin Tong, Yikang Li, Yuchen Sun, Yue, Zhao, Yuhan Lu, Yuwei Li, Zane Zhang, Zeshi Yang, Zi Ye

Abstract

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

LPM 1.0: Video-based Character Performance Model

Abstract

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

Paper Structure

This paper contains 72 sections, 7 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: LPM 1.0 generates identity-consistent conversational video with synchronized verbal and non-verbal behaviors---speaking, listening, micro-expressions, and natural motion---while maintaining visual fidelity across streaming and long-horizon video generation.
  • Figure 2: Data filtering and classification pipeline across four stages. Raw video is progressively filtered through single-shot extraction, quality filtering and cropping, conversation detection and clipping, and finally captioning with embedding generation to produce high-diversity, semantically rich, and emotionally expressive trainable clips.
  • Figure 3: Illustration of the conversational audio-video data processing pipeline.(1)Tracking and cropping converts multi-person clips into single-person clips; (2)Three-state labeling applies fine-tuned LR-ASD to produce frame-wise speak/listen/idle states; (3)Refinement and audio separation verifies and filters labels, then outputs speaker/listener-only audio tracks for retained clips.
  • Figure 4: An example of our proposed multi-granularity identity-aware reference images. For each subject, we extract three complementary reference types from raw videos: (i) a global appearance reference capturing overall identity and background context; (ii)multi-view body references covering one to four viewpoints to provide appearance evidence; (iii) a set of facial expression references spanning one to eight expressive states, enabling faithful reproduction of identity-specific details.
  • Figure 5: Base LPM architecture. Inputs (noise video, the first frame, identity-aware reference images, text, speak audio, and listen audio) are encoded by modality-specific encoders and injected into a stack of DiT blocks via self-attention (visual tokens) and cross-attention (text and audio embeddings). The output video latent is decoded by a VAE decoder to produce the generated video.
  • ...and 10 more figures