Table of Contents
Fetching ...

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

TL;DR

This work tackles the generalization gap in instruction-based video editing caused by reliance on simplistic paired edits. It introduces VIVA, a framework that couples a VLM-based instructor with a diffusion-based editor and a post-training Edit-GRPO stage to improve instruction fidelity, content preservation, and aesthetics. A large synthetic data pipeline and LoRA-enhanced RL optimization enable robust, open-domain edits, including reference-image control, yielding superior results on the VIE-Bench and competitive performance with a commercial model. The approach advances controllable, high-quality video editing with strong generalization to complex real-world instructions. This has practical implications for flexible, user-guided video editing in media production and personalization tools.

Abstract

Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

TL;DR

This work tackles the generalization gap in instruction-based video editing caused by reliance on simplistic paired edits. It introduces VIVA, a framework that couples a VLM-based instructor with a diffusion-based editor and a post-training Edit-GRPO stage to improve instruction fidelity, content preservation, and aesthetics. A large synthetic data pipeline and LoRA-enhanced RL optimization enable robust, open-domain edits, including reference-image control, yielding superior results on the VIE-Bench and competitive performance with a commercial model. The approach advances controllable, high-quality video editing with strong generalization to complex real-world instructions. This has practical implications for flexible, user-guided video editing in media production and personalization tools.

Abstract

Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

Paper Structure

This paper contains 35 sections, 14 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Example results generated by VIVA in comparison with Runway Gen-4 Aleph runway2025aleph. Our method supports instruction-based video editing with an optional reference image as input (shown as the inset teddy bear in the first row). Runway Gen-4 Aleph over-completes the editing instruction, removing both the hand and the cigarette entirely, and fails to preserve the identity of the teddy bear.
  • Figure 2: Overall pipeline of VIVA. A context-aware VLM instructor encodes the system prompt, instruction, first frame of the source video, and an optional reference image into VLM tokens. A trainable token refiner aligns these tokens to the pretrained DiT latent space. The VAE encodings of the source video and optional reference image are added to the noisy latent to form context-aware noise tokens. Finally, the DiT denoises these tokens under VLM guidance to generate the edited video.
  • Figure 3: Overall pipeline of Edit-GRPO. We inject stochasticity via Flow-SDE liu2025flowgrpo to generate diverse samples, score them with our reward system, and compute a GRPO loss from the resulting relative advantages to update the model. For efficiency, we optimize a LoRA instead of full fine-tuning.
  • Figure 4: Qualitative comparison of the instruction-based video editing on the VIE-Bench mou2025instructx dataset. The editing instruction corresponding to each group of results is shown at the bottom.
  • Figure 5: User study results. We conduct 1-to-1 paired comparisons. Win indicates users prefer ours better than baseline, vice versa. Tie indicates there is no significant difference. Numbers are reported in percentage.
  • ...and 9 more figures