Table of Contents
Fetching ...

Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning

Harold Haodong Chen, Harry Yang, Ser-Nam Lim

TL;DR

The paper tackles the limited generality and high cost of existing video editing methods by introducing UES, a self-supervised fine-tuning framework that turns text(+image)-to-video diffusion models into unified generation-editing systems via dual conditioning on the reference video and caption. It leverages a lightweight LoRA-based adaptation and a CLIP-derived video encoding with a dual-path cross-attention scheme to learn intrinsic text-video semantic correspondence, enabling versatile edits guided by delta prompts or full captions. The authors also introduce OmniBench-99, a diverse 99-video benchmark spanning four editing types and eight scenarios to systematically evaluate universal editing. Experimental results show that UES enhances generation quality while granting powerful, generalizable editing capabilities without extra supervision, achieving substantial parameter efficiency and broad applicability to text(+image)-to-video models.

Abstract

Recent advances in video generation have outpaced progress in video editing, which remains constrained by several limiting factors, namely: (a) the task's dependency on supervision severely limits generality, (b) an unnecessary artificial separation between the generation and editing task, and (c) the high computational costs of training a video model. In this work, we propose UES (Unlocking Universal Editing via Self-Supervision), a lightweight self-supervised fine-tuning strategy that transforms generation models into unified generation-editing systems through self-supervised semantic alignment. Our approach establishes a dual-conditioning mechanism where original video-text pairs jointly provide visual and textual semantics, enabling structured learning of intrinsic spatiotemporal correspondences. Key advantages include: (i) Universality through supervision-free adaptation to diverse editing tasks, (ii) Unification of generation and editing applicable to most text(+image)-to-video model, and (iii) Efficiency via lightweight fine-tune that reduces tunable parameters by 92.67%. To enable systematic evaluation, we introduce OmniBench-99, a comprehensive benchmark spanning 99 videos across humans/animals, environments, and objects, comprising 4 editing types and 8 scenarios. Extensive experiments show UES enables models without inherent editing capability to perform powerful and universal editing while preserving or even enhancing their original generation performance.

Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning

TL;DR

The paper tackles the limited generality and high cost of existing video editing methods by introducing UES, a self-supervised fine-tuning framework that turns text(+image)-to-video diffusion models into unified generation-editing systems via dual conditioning on the reference video and caption. It leverages a lightweight LoRA-based adaptation and a CLIP-derived video encoding with a dual-path cross-attention scheme to learn intrinsic text-video semantic correspondence, enabling versatile edits guided by delta prompts or full captions. The authors also introduce OmniBench-99, a diverse 99-video benchmark spanning four editing types and eight scenarios to systematically evaluate universal editing. Experimental results show that UES enhances generation quality while granting powerful, generalizable editing capabilities without extra supervision, achieving substantial parameter efficiency and broad applicability to text(+image)-to-video models.

Abstract

Recent advances in video generation have outpaced progress in video editing, which remains constrained by several limiting factors, namely: (a) the task's dependency on supervision severely limits generality, (b) an unnecessary artificial separation between the generation and editing task, and (c) the high computational costs of training a video model. In this work, we propose UES (Unlocking Universal Editing via Self-Supervision), a lightweight self-supervised fine-tuning strategy that transforms generation models into unified generation-editing systems through self-supervised semantic alignment. Our approach establishes a dual-conditioning mechanism where original video-text pairs jointly provide visual and textual semantics, enabling structured learning of intrinsic spatiotemporal correspondences. Key advantages include: (i) Universality through supervision-free adaptation to diverse editing tasks, (ii) Unification of generation and editing applicable to most text(+image)-to-video model, and (iii) Efficiency via lightweight fine-tune that reduces tunable parameters by 92.67%. To enable systematic evaluation, we introduce OmniBench-99, a comprehensive benchmark spanning 99 videos across humans/animals, environments, and objects, comprising 4 editing types and 8 scenarios. Extensive experiments show UES enables models without inherent editing capability to perform powerful and universal editing while preserving or even enhancing their original generation performance.

Paper Structure

This paper contains 24 sections, 7 equations, 32 figures, 6 tables.

Figures (32)

  • Figure 1: UES: Unlocking Universal Editing via Self-Supervision. We introduce UES, a lightweight self-supervised fine-tuning strategy that enables text(+image)-to-video models to achieve universal editing capabilities without relying on additional supervision. (a) Current T(+I)2V models excel at generation but lack inherent editing capabilities and incur high computational costs for fine-tuning. (b) Existing video editing models often require additional supervision. (c) By applying UES, a unified model is achieved, seamlessly combining generation and editing functionalities. (d) Compared to existing editing models, which are limited to four editing types, UES extends editing capabilities across four types and eight diverse editing scenarios.
  • Figure 2: Potential structures of UES video condition encoding.
  • Figure 3: Toy experiments of the generation and editing capability of three structures in Fig. \ref{['fig:video_cond']}. TExp. 1 (Stuc. A): Over-relies on video conditions, neglecting text input. TExp. 2 (Stuc. B): Combines conditions but causes semantic ambiguity. TExp. 3 (Struc. C): Dual-path strategy enables both generation and editing. TExp. 4 (Struc. C w/ Adaptation): Adaptation (Fig. \ref{['fig:ues_cond']}) further enhances spatiotemporal learning, e.g., preserving unedited areas.
  • Figure 4: Illustration of UES condition modeling.
  • Figure 5: Illustration of semantic correspondence. (Top) Reference video and its original caption. (Middle) Results using only one condition. (Bottom) Effects of full sentence vs. delta prompt.
  • ...and 27 more figures