InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem
Yeobin Hong, Suhyeon Lee, Hyungjin Chung, Jong Chul Ye
TL;DR
<3-5 sentence high-level summary> InverseCrafter reframes controllable 4D video generation as a latent-space inverse problem to avoid costly fine-tuning of pre-trained Video Diffusion Models (VDMs). It introduces a continuous, multi-channel latent mask generated either by a lightweight learned encoder or a training-free projection, enabling data-consistency guidance entirely in the latent space with DDS and CG-based optimization. The approach delivers near-zero inference overhead and strong performance on camera-control and editing-inpainting tasks, while maintaining the original VDM priors. This yields a flexible, general-purpose video inpainting solver that can be used for both precise camera manipulation and text-guided content editing without retraining the model.
Abstract
Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model's original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at https://github.com/yeobinhong/InverseCrafter.
