Table of Contents
Fetching ...

InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem

Yeobin Hong, Suhyeon Lee, Hyungjin Chung, Jong Chul Ye

TL;DR

<3-5 sentence high-level summary> InverseCrafter reframes controllable 4D video generation as a latent-space inverse problem to avoid costly fine-tuning of pre-trained Video Diffusion Models (VDMs). It introduces a continuous, multi-channel latent mask generated either by a lightweight learned encoder or a training-free projection, enabling data-consistency guidance entirely in the latent space with DDS and CG-based optimization. The approach delivers near-zero inference overhead and strong performance on camera-control and editing-inpainting tasks, while maintaining the original VDM priors. This yields a flexible, general-purpose video inpainting solver that can be used for both precise camera manipulation and text-guided content editing without retraining the model.

Abstract

Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model's original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at https://github.com/yeobinhong/InverseCrafter.

InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem

TL;DR

<3-5 sentence high-level summary> InverseCrafter reframes controllable 4D video generation as a latent-space inverse problem to avoid costly fine-tuning of pre-trained Video Diffusion Models (VDMs). It introduces a continuous, multi-channel latent mask generated either by a lightweight learned encoder or a training-free projection, enabling data-consistency guidance entirely in the latent space with DDS and CG-based optimization. The approach delivers near-zero inference overhead and strong performance on camera-control and editing-inpainting tasks, while maintaining the original VDM priors. This yields a flexible, general-purpose video inpainting solver that can be used for both precise camera manipulation and text-guided content editing without retraining the model.

Abstract

Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model's original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at https://github.com/yeobinhong/InverseCrafter.

Paper Structure

This paper contains 36 sections, 13 equations, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: Representative video on camera control ("zoom in," "arc left," "arc right") and inpainting with editing ("goldfish" to "turtle").
  • Figure 2: (a) Prior work naively downsamples the pixel space mask via spatio-temporal interpolation. This process results in a single-channel, binary mask that is uniformly broadcast across all $C$ latent channels, ignoring their distinct feature representations and leading to information loss. (b) InverseCrafter computes a continuous, $C$-channel latent mask, enabling efficient latent-space guidance.
  • Figure 3: Overview of InverseCrafter.(a)${\mathcal{P}_\phi}$ is trained to project the pixel space degradation operator to the latent domain. (b) During inference, ${\bm{z}}_t$ is optimized at each $t$ to enforce data consistency Eq. (\ref{['eq:latent_prox']}), the latent mask derived from either ${\mathcal{P}_\phi}$ or the training-free alternative.
  • Figure 4: Video camera control results with novel content generation. (Up) "+a grey tree." (Bottom) "+a flower vase."
  • Figure 5: Qualitative comparison of video camera control. Camera trajectories are ("arc left" "zoom in"). Our method demonstrates a clear advantage in source consistency and semantically aligned generation. Insets provide magnified views of the regions marked by yellow boxes.
  • ...and 10 more figures