Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
Yang Yang, Siming Zheng, Qirui Yang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang
TL;DR
This work tackles the challenge of controllable, temporally coherent video bokeh for arbitrary inputs. It introduces Any-to-Bokeh, a one-step diffusion framework conditioned on a focal-plane–adapted MPI, augmented by a three-stage progressive training strategy and a weighted overlap inference scheme to ensure temporal stability. Key innovations include disparity-aware MPI sampling around the focal plane with $h_i=\left(\frac{i}{N}\right)^{\frac{1}{d_f}}$, MPI-attention–driven geometry blocks, and conditioning on the disparity-difference map $V_D$ and blur strength $K$. Experiments on synthetic and real data show state-of-the-art temporal coherence, spatial fidelity, and explicit control over focus and blur, establishing a practical baseline for depth-aware video bokeh in real-world applications. $V_D$ and $K$ appear as explicit control signals in the diffusion conditioning, enabling precise focus placement and blur strength across diverse scenes.
Abstract
Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.
