Table of Contents
Fetching ...

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Jaehong Yoon, Shoubin Yu, Mohit Bansal

TL;DR

RACCooN presents a two-stage video-to-paragraph-to-video framework that converts videos into structured, object-centric narratives using multi-granular spatiotemporal pooling, then uses these narratives to drive a unified diffusion-based editing model for remove/add/change tasks. A dedicated VPLM dataset, combined with instructional fine-tuning (LoRA) on video-language backbones, enables high-quality V2P descriptions and accurate P2V edits. The approach achieves superior performance on both V2P captioning tasks and three object-centric editing subtasks, demonstrating versatile, user-friendly video editing without hand-crafted prompts. This framework could significantly broaden accessible, text-driven video editing by leveraging auto-generated, detailed instructions.

Abstract

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

TL;DR

RACCooN presents a two-stage video-to-paragraph-to-video framework that converts videos into structured, object-centric narratives using multi-granular spatiotemporal pooling, then uses these narratives to drive a unified diffusion-based editing model for remove/add/change tasks. A dedicated VPLM dataset, combined with instructional fine-tuning (LoRA) on video-language backbones, enables high-quality V2P descriptions and accurate P2V edits. The approach achieves superior performance on both V2P captioning tasks and three object-centric editing subtasks, demonstrating versatile, user-friendly video editing without hand-crafted prompts. This framework could significantly broaden accessible, text-driven video editing by leveraging auto-generated, detailed instructions.

Abstract

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
Paper Structure (18 sections, 2 equations, 25 figures, 6 tables)

This paper contains 18 sections, 2 equations, 25 figures, 6 tables.

Figures (25)

  • Figure 1: Overview of RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework, enables users to remove, add, or change video content via updating auto-generated narratives.
  • Figure 2: Illustration of RACCooN.RACCooN generates video descriptions with the three distinct pooled visual tokens, including Multi-Granular Spatiotemporal (MGS) Pooling. Next, users can edit the generated descriptions by adding, removing, or modifying words to create new videos. Note that for adding object tasks, if users do not provide layout information for the objects they want to add, RACCooN can predict the target layout in each frame.
  • Figure 3: Illustration of MGS pooling. We obtain MGS pooling tokens using a spatiotemporal mask $\bm{m}$ via overlapping k-means clustering (OKM) of averaged superpixel features $\bar{\bm{S}}$.
  • Figure 4: Qualitative Comparison between RACCooN and other baselines. Baseline names are abbreviated: VC: VideoComposer, I-A: Inpainting Anything, TF: TokenFlow. We underlined visual details in our caption. More visualizations are in the supplementary material.
  • Figure 5: Qualitative V2P example of our RACCooN on Sora video.
  • ...and 20 more figures