Table of Contents
Fetching ...

In-Context Sync-LoRA for Portrait Video Editing

Sagi Polaczek, Or Patashnik, Ali Mahdavi-Amiri, Daniel Cohen-Or

TL;DR

This work introduces Sync-LoRA, a diffusion-based framework for frame-accurate portrait video editing conducted through in-context LoRA fine-tuning. By generating synchronized edited-identical video pairs and filtering them with multi-channel motion cues, the model learns to propagate local appearance changes defined by the edited first frame while preserving source motion and identity. The approach combines a transformer-based image-to-video backbone with 3D rotary positional embeddings and a rectified flow loss, enabling robust edits across diverse identities and tasks, including background changes and expression modification. Comprehensive experiments demonstrate superior temporal coherence, edit fidelity, and identity preservation compared with state-of-the-art baselines, along with detailed ablations and user studies. The method offers a practical, scalable pathway to high-fidelity, temporally-consistent portrait video edits in real-world applications.

Abstract

Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.

In-Context Sync-LoRA for Portrait Video Editing

TL;DR

This work introduces Sync-LoRA, a diffusion-based framework for frame-accurate portrait video editing conducted through in-context LoRA fine-tuning. By generating synchronized edited-identical video pairs and filtering them with multi-channel motion cues, the model learns to propagate local appearance changes defined by the edited first frame while preserving source motion and identity. The approach combines a transformer-based image-to-video backbone with 3D rotary positional embeddings and a rectified flow loss, enabling robust edits across diverse identities and tasks, including background changes and expression modification. Comprehensive experiments demonstrate superior temporal coherence, edit fidelity, and identity preservation compared with state-of-the-art baselines, along with detailed ablations and user studies. The method offers a practical, scalable pathway to high-fidelity, temporally-consistent portrait video edits in real-world applications.

Abstract

Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.

Paper Structure

This paper contains 39 sections, 8 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of Sync-LoRA. Given a source video $S$, an edited first frame $I$, and an edit prompt $P$, Sync-LoRA denoises the target video $T$ conditioned on these inputs. During training, only the edited branch is noised, while the source branch stays clean and provides motion and identity cues through shared attention, so the model copies motion from $S$ and propagates the local edit across all frames.
  • Figure 2: Data generation and curation pipeline. Our process constructs synchronized video pairs for Sync-LoRA training. (Top) Portrait images are generated, edited, and converted into side-by-side talking-head videos. (Middle) Facial and pose landmarks yield motion signals for speech, gaze, blink, and pose. (Bottom) Pairs are scored and filtered by synchronization quality, keeping only the most aligned examples for training.
  • Figure 3: Synchronization signal visualization. Two synchronization cues used in our filtering process. Top: Eye landmarks are used to compute the Eye Aspect Ratio (EAR). Note how the plotted peaks (reference in green, edited in orange) correspond directly to the blink event shown in the frames above. Bottom: Upper-body pose landmarks are used to track the right elbow angle. The plots again show tightly correlated motion, confirming the arm movement is synchronized across both videos.
  • Figure 4: Comparison of portrait video editing methods. The rows show the source video and results from LucyEdit, VACE, AnyV2V, FlowEdit, and Sync-LoRA (Ours). The columns depict different temporal positions. Our method, VACE, AnyV2V, and FlowEdit utilize the same edited first frame as visual input, whereas the text-based LucyEdit operates from text guidance alone.
  • Figure 5: Necessity of all synchronization cues. Each column shows results when training without the specified motion cue (pose, gaze, speech, or blink) from the filtering stage, compared to our full setup. The source video is shown on the left. Omitting any cue causes motion drift or misalignment across frames.
  • ...and 7 more figures