Table of Contents
Fetching ...

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Hao-Yu Hsu, Zhi-Hao Lin, Albert Zhai, Hongchi Xia, Shenlong Wang

TL;DR

<3-5 sentence high-level summary>AutoVFX tackles the challenge of democratizing physically plausible VFX creation by translating natural language prompts into executable editing programs that operate on a three-part pipeline: neural scene modeling, LLM-driven program synthesis, and physics-based rendering/simulation. It builds a holistic scene representation—combining geometry via BakedSDF, appearance via Gaussian Splatting and textured meshes, semantics via open-vocabulary segmentation, and lighting via HDR maps—to support a broad suite of edits, from object insertion/removal to dynamic simulations. A modular library of VFX functions, orchestrated by GPT-4-derived programs, enables flexible, scalable edits that are validated through extensive qualitative and quantitative experiments, including user studies. Results show AutoVFX surpasses state-of-the-art baselines in instruction alignment, realism, and physical plausibility, highlighting its potential to democratize advanced VFX creation while integrating seamlessly with traditional rendering toolchains like Blender.

Abstract

Modern visual effects (VFX) software has made it possible for skilled artists to create imagery of virtually anything. However, the creation process remains laborious, complex, and largely inaccessible to everyday users. In this work, we present AutoVFX, a framework that automatically creates realistic and dynamic VFX videos from a single video and natural language instructions. By carefully integrating neural scene modeling, LLM-based code generation, and physical simulation, AutoVFX is able to provide physically-grounded, photorealistic editing effects that can be controlled directly using natural language instructions. We conduct extensive experiments to validate AutoVFX's efficacy across a diverse spectrum of videos and instructions. Quantitative and qualitative results suggest that AutoVFX outperforms all competing methods by a large margin in generative quality, instruction alignment, editing versatility, and physical plausibility.

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

TL;DR

<3-5 sentence high-level summary>AutoVFX tackles the challenge of democratizing physically plausible VFX creation by translating natural language prompts into executable editing programs that operate on a three-part pipeline: neural scene modeling, LLM-driven program synthesis, and physics-based rendering/simulation. It builds a holistic scene representation—combining geometry via BakedSDF, appearance via Gaussian Splatting and textured meshes, semantics via open-vocabulary segmentation, and lighting via HDR maps—to support a broad suite of edits, from object insertion/removal to dynamic simulations. A modular library of VFX functions, orchestrated by GPT-4-derived programs, enables flexible, scalable edits that are validated through extensive qualitative and quantitative experiments, including user studies. Results show AutoVFX surpasses state-of-the-art baselines in instruction alignment, realism, and physical plausibility, highlighting its potential to democratize advanced VFX creation while integrating seamlessly with traditional rendering toolchains like Blender.

Abstract

Modern visual effects (VFX) software has made it possible for skilled artists to create imagery of virtually anything. However, the creation process remains laborious, complex, and largely inaccessible to everyday users. In this work, we present AutoVFX, a framework that automatically creates realistic and dynamic VFX videos from a single video and natural language instructions. By carefully integrating neural scene modeling, LLM-based code generation, and physical simulation, AutoVFX is able to provide physically-grounded, photorealistic editing effects that can be controlled directly using natural language instructions. We conduct extensive experiments to validate AutoVFX's efficacy across a diverse spectrum of videos and instructions. Quantitative and qualitative results suggest that AutoVFX outperforms all competing methods by a large margin in generative quality, instruction alignment, editing versatility, and physical plausibility.

Paper Structure

This paper contains 55 sections, 18 figures, 2 tables, 1 algorithm.

Figures (18)

  • Figure 1: AutoVFX takes a video and language instructions as input, and automatically generates programs to produce visual effects and render a new video according to the instructions. It can modify appearance and geometry, enable dynamic interactions, apply particle effects, and even insert animated characters, producing results that are photorealistic, physically-plausible, and easily controllable.
  • Figure 2: AutoVFX framework. Our instruction-guided video editing framework consists of three main modules: (1) 3D Scene Modeling (left), which integrates 3D reconstruction and scene understanding models; (2) Program Generation (middle), where LLMs generate editing programs based on user instructions; and (3) VFX Modules (right), which include predefined functions specialized for various editing tasks. These components are integrated with a physically-based simulation and rendering engine (e.g., Blender) to generate the final video.
  • Figure 3: Program generation. The LLM generates the editing program through in-context learning. With provided context and examples, it learns to call VFX modules and, given unseen user instructions (blue block), generates the program (orange block).
  • Figure 4: Dynamic VFX video editing using AutoVFX. Our approach enables physical interaction, articulated animation, particle effects, insertion of generated 3D assets, material editing, and geometry fracturing.
  • Figure 5: Qualitative comparison on static editing.
  • ...and 13 more figures