Table of Contents
Fetching ...

Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

Yang Yang, Siming Zheng, Qirui Yang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

TL;DR

This work tackles the challenge of controllable, temporally coherent video bokeh for arbitrary inputs. It introduces Any-to-Bokeh, a one-step diffusion framework conditioned on a focal-plane–adapted MPI, augmented by a three-stage progressive training strategy and a weighted overlap inference scheme to ensure temporal stability. Key innovations include disparity-aware MPI sampling around the focal plane with $h_i=\left(\frac{i}{N}\right)^{\frac{1}{d_f}}$, MPI-attention–driven geometry blocks, and conditioning on the disparity-difference map $V_D$ and blur strength $K$. Experiments on synthetic and real data show state-of-the-art temporal coherence, spatial fidelity, and explicit control over focus and blur, establishing a practical baseline for depth-aware video bokeh in real-world applications. $V_D$ and $K$ appear as explicit control signals in the diffusion conditioning, enabling precise focus placement and blur strength across diverse scenes.

Abstract

Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.

Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

TL;DR

This work tackles the challenge of controllable, temporally coherent video bokeh for arbitrary inputs. It introduces Any-to-Bokeh, a one-step diffusion framework conditioned on a focal-plane–adapted MPI, augmented by a three-stage progressive training strategy and a weighted overlap inference scheme to ensure temporal stability. Key innovations include disparity-aware MPI sampling around the focal plane with , MPI-attention–driven geometry blocks, and conditioning on the disparity-difference map and blur strength . Experiments on synthetic and real data show state-of-the-art temporal coherence, spatial fidelity, and explicit control over focus and blur, establishing a practical baseline for depth-aware video bokeh in real-world applications. and appear as explicit control signals in the diffusion conditioning, enabling precise focus placement and blur strength across diverse scenes.

Abstract

Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.

Paper Structure

This paper contains 27 sections, 10 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Any-to-Bokeh enables users to customize the focal plane and adjust bokeh intensity. The yellow box indicates the focal plane, and the grayscale values in the image represent the distance to the focal plane, with higher values indicating closer proximity. $K$ represents the bokeh intensity.
  • Figure 2: Two key components of Any-to-Bokeh. (a) One-step video bokeh pipeline: receives input of any video and disparity relative to the focal plane to perform the bokeh effect. (b) MPI spatial block: uses the MPI mask $\mathcal{M}$ to prompt MPI spatial block to guide bokeh rendering. The user-defined blur strength $K$ is injected through embedding.
  • Figure 3: Progressive Training Strategy: Stage 1: Train the whole U-Net and adapters. Stage 2: Refine the temporal block with the disturbance. Stage 3: Fine-tuning VAE decoder. We desaturated the colors in the same areas.
  • Figure 4: Qualitative Results on Real-World Video frames. To highlight the differences, we zoom in on the red and green regions. Red arrows indicate incorrectly focused areas.
  • Figure 5: Visualization of generated bokeh effects on DAVIS dataset. The yellow cross represents the focus subject. Please zoom in to see the details.
  • ...and 6 more figures