Table of Contents
Fetching ...

Lynx: Towards High-Fidelity Personalized Video Generation

Shen Sang, Tiancheng Zhi, Tianpei Gu, Jing Liu, Linjie Luo

TL;DR

Lynx tackles high-fidelity personalized video generation from a single image by extending a Diffusion Transformer with two lightweight adapters: an ID-adapter that injects ArcFace-derived identity tokens via a Perceiver Resampler, and a Ref-adapter that fuses dense VAE features through a frozen reference pathway. The approach uses spatio-temporal frame packing and progressive training to handle variable video lengths and resolutions, achieving robust identity preservation while maintaining temporal coherence. Evaluations on 40 subjects and 800 test cases show state-of-the-art identity fidelity with competitive prompt following and high perceptual video quality, validated by multiple face recognizers and Gemini-based metrics. Overall, Lynx demonstrates a scalable, non-finetuning path to personalized video generation with strong identity, controllability, and realism, paving the way for multi-modal and multi-subject personalization.

Abstract

We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

Lynx: Towards High-Fidelity Personalized Video Generation

TL;DR

Lynx tackles high-fidelity personalized video generation from a single image by extending a Diffusion Transformer with two lightweight adapters: an ID-adapter that injects ArcFace-derived identity tokens via a Perceiver Resampler, and a Ref-adapter that fuses dense VAE features through a frozen reference pathway. The approach uses spatio-temporal frame packing and progressive training to handle variable video lengths and resolutions, achieving robust identity preservation while maintaining temporal coherence. Evaluations on 40 subjects and 800 test cases show state-of-the-art identity fidelity with competitive prompt following and high perceptual video quality, validated by multiple face recognizers and Gemini-based metrics. Overall, Lynx demonstrates a scalable, non-finetuning path to personalized video generation with strong identity, controllability, and realism, paving the way for multi-modal and multi-subject personalization.

Abstract

We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

Paper Structure

This paper contains 13 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Left: Lynx consistently preserves facial identity with high fidelity, while producing natural motion, coherent lighting, and flexible scene adaptation (input shown at top-left). Right: Lynx demonstrates clear superiority in identity resemblance and perceptual quality, while remaining competitive in motion naturalness compared to other methods.
  • Figure 2: Videos generated from a single input image, showing strong identity preservation across expressive facial expressions (rows 3), diverse lighting (rows 1, 4, 5), pose variations (rows 2, 6, 7), and object interactions (rows 8).
  • Figure 3: Architecture of Lynx. Built on a DiT-based video foundation model, Lynx introduces two adapter modules that inject identity features through cross-attention.
  • Figure 4: Examples of our augmentation strategies: (a) expression augmentation via X-Nemo zhao2025x, and (b) portrait relighting via LBM chadebec2025lbm.
  • Figure 5: Qualitative comparison with baseline methods. Competing methods often exhibit issues such as unrealistic actions (row 1 example 2), copy-pasting effects of background (row 4 example 2) or lighting (row 5 example 2), or poor identity resemblance (row 1 example 1, row 3 example 2). In contrast, Lynx consistently preserves facial identity with high fidelity, while producing natural motion, coherent lighting, and flexible scene adaptation.