Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning
Harold Haodong Chen, Harry Yang, Ser-Nam Lim
TL;DR
The paper tackles the limited generality and high cost of existing video editing methods by introducing UES, a self-supervised fine-tuning framework that turns text(+image)-to-video diffusion models into unified generation-editing systems via dual conditioning on the reference video and caption. It leverages a lightweight LoRA-based adaptation and a CLIP-derived video encoding with a dual-path cross-attention scheme to learn intrinsic text-video semantic correspondence, enabling versatile edits guided by delta prompts or full captions. The authors also introduce OmniBench-99, a diverse 99-video benchmark spanning four editing types and eight scenarios to systematically evaluate universal editing. Experimental results show that UES enhances generation quality while granting powerful, generalizable editing capabilities without extra supervision, achieving substantial parameter efficiency and broad applicability to text(+image)-to-video models.
Abstract
Recent advances in video generation have outpaced progress in video editing, which remains constrained by several limiting factors, namely: (a) the task's dependency on supervision severely limits generality, (b) an unnecessary artificial separation between the generation and editing task, and (c) the high computational costs of training a video model. In this work, we propose UES (Unlocking Universal Editing via Self-Supervision), a lightweight self-supervised fine-tuning strategy that transforms generation models into unified generation-editing systems through self-supervised semantic alignment. Our approach establishes a dual-conditioning mechanism where original video-text pairs jointly provide visual and textual semantics, enabling structured learning of intrinsic spatiotemporal correspondences. Key advantages include: (i) Universality through supervision-free adaptation to diverse editing tasks, (ii) Unification of generation and editing applicable to most text(+image)-to-video model, and (iii) Efficiency via lightweight fine-tune that reduces tunable parameters by 92.67%. To enable systematic evaluation, we introduce OmniBench-99, a comprehensive benchmark spanning 99 videos across humans/animals, environments, and objects, comprising 4 editing types and 8 scenarios. Extensive experiments show UES enables models without inherent editing capability to perform powerful and universal editing while preserving or even enhancing their original generation performance.
