Table of Contents
Fetching ...

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, Raja Giryes

TL;DR

The proposed ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together, is the first method to personalize visual appearance and voice in a single generative pass.

Abstract

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

TL;DR

The proposed ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together, is the first method to personalize visual appearance and voice in a single generative pass.

Abstract

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.
Paper Structure (65 sections, 5 equations, 7 figures, 5 tables)

This paper contains 65 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Unified Audio-Visual Personalization with ID-LoRA. ID-LoRA takes a reference audio clip and a target first frame as input. Unlike cascaded pipelines that treat modalities separately, ID-LoRA jointly generates synchronized video and audio in a single pass. This unified approach allows a text prompt to simultaneously dictate novel content, such as a specific speaking style and environmental acoustics (e.g., "a jackhammer is drilling in the background"), while ensuring the subject's vocal identity and visual likeness are preserved across the generated sequence. Video examples with audio are available in the supplementary material.
  • Figure 2: ID-LoRA (overview). Target first frame and reference audio are encoded into latents, concatenated with noisy targets, and fed to a shared LTX-2 DiT adapted with In-Context LoRA. Reference audio tokens receive negative temporal positions in the RoPE space, cleanly separating them from the target tokens. Joint text conditioning yields synchronized, identity-preserving video and audio whose acoustics follow the prompt and adapted to the new environment.
  • Figure 3: Human evaluation results: A/B preference rates (%) on the hard (cross-video) split. Annotators on AMT evaluated 35 pairs across 8 speakers along three axes: voice similarity, environment sounds, and speech manners.
  • Figure 4: Examples from the MOS evaluation set. Each column shows two generated videos from the same speaker under different environmental sound conditions.
  • Figure 5: Environment sound interaction MOS study: per-scenario mean opinion scores (1--5) with 95% confidence intervals for ID-LoRA and Kling 2.6 Pro across ten physical interaction scenarios. ID-LoRA scores higher on 8 of 10 scenarios and in overall score.
  • ...and 2 more figures