Table of Contents
Fetching ...

Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

Yiqing Shen, Chenjia Li, Mathias Unberath

TL;DR

This paper tackles the challenge of text-driven video editing when user queries are implicit and require multi-hop reasoning to identify editing targets. It introduces RIVER, a framework that decouples reasoning from generation by constructing a digital twin representation of video content, enabling an LLM to reason over semantic, spatial, and temporal relationships before instructing a diffusion-based editor. The approach is trained with reinforcement learning (GRPO) using rewards that enforce correct reasoning and high-quality edits, and it is evaluated on RVEBenchmark as well as VegGIE and FiVE, where it achieves state-of-the-art results. The work demonstrates the utility of digital twin representations as an intermediate reasoning layer bridging perception and pixel-level video synthesis, offering robust handling of implicit queries and precise edits with potential impact on content creation workflows.

Abstract

Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.

Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

TL;DR

This paper tackles the challenge of text-driven video editing when user queries are implicit and require multi-hop reasoning to identify editing targets. It introduces RIVER, a framework that decouples reasoning from generation by constructing a digital twin representation of video content, enabling an LLM to reason over semantic, spatial, and temporal relationships before instructing a diffusion-based editor. The approach is trained with reinforcement learning (GRPO) using rewards that enforce correct reasoning and high-quality edits, and it is evaluated on RVEBenchmark as well as VegGIE and FiVE, where it achieves state-of-the-art results. The work demonstrates the utility of digital twin representations as an intermediate reasoning layer bridging perception and pixel-level video synthesis, offering robust handling of implicit queries and precise edits with potential impact on content creation workflows.

Abstract

Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.

Paper Structure

This paper contains 25 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of video editing paradigms. (a) Diffusion-based methods directly process explicit queries through conditional generation but struggle when editing targets require inference. (b) Agent-based approaches employ MLLMs for planning and execution but still depend on explicit specifications of what and where to edit. (c) Our proposed RIVER framework handles implicit queries by first constructing a digital twin representation where an LLM performs multi-hop reasoning to identify editing targets, then guides the diffusion model to execute modifications.
  • Figure 2: Overview of RIVER framework. Green: Digital twin representation construction extracts structured scene information from video frames using specialist vision models. Orange: GRPO training workflow updates the LLM through reward signals that validate output format correctness ($R_{\text{token}}$), code executability ($R_{\text{exec}}$), edited representation validity ($R_{\text{dt}}$), and generation performance ($R_{\text{perf}}$). Blue: Inference workflow processes implicit queries through four stages. LLM performs reasoning analysis (<think>), generates executable code for spatial-temporal verification (<execute>), evaluates computation results (<results>), and produces edited digital twin representations (<edit>) to guide the diffusion model to synthesize the final edited video.
  • Figure 3: Overview of RVEBenchmark dataset structure and examples. (a) Distribution of 519 queries across three difficulty levels, where each level combines different reasoning categories. (b) Token distribution reveals that Level 3 queries contain the most tokens (61.1% for semantic, 54.3% for spatial, 78.7% for temporal), reflecting their increased complexity. (c) Four representative examples demonstrate progressive reasoning difficulty.
  • Figure 4: Qualitative comparison of RIVER against six baseline methods on RVEBenchmark. The implicit query requires multi-hop reasoning to identify the target subject based on visual attributes, spatial relationships, and semantic understanding. The bottom three rows shown the original digital twin representation, the edited digital twin representation from LLM, and RIVER's final edited video frames.