Table of Contents
Fetching ...

DiffCamera: Arbitrary Refocusing on Images

Yiyang Wang, Xi Chen, Xiaogang Xu, Yu Liu, Hengshuang Zhao

TL;DR

DoF control on existing images is difficult with traditional prompts or single-image edits. DiffCamera introduces a diffusion-transformer framework conditioned on an arbitrary focus point and blur level, trained on large-scale simulated DoF pairs, and regularized by a stacking-based constraint and depth dropout to ensure physically consistent refocusing. The approach delivers robust, high-precision refocusing across diverse scenes and depths, with a dedicated 150-scene benchmark and multiple evaluation metrics demonstrating improved DoF manipulation and content fidelity. This work enables practical, flexible post-processing and enhancements for generative AI systems that require controllable depth-of-field effects.

Abstract

The depth-of-field (DoF) effect, which introduces aesthetically pleasing blur, enhances photographic quality but is fixed and difficult to modify once the image has been created. This becomes problematic when the applied blur is undesirable~(e.g., the subject is out of focus). To address this, we propose DiffCamera, a model that enables flexible refocusing of a created image conditioned on an arbitrary new focus point and a blur level. Specifically, we design a diffusion transformer framework for refocusing learning. However, the training requires pairs of data with different focus planes and bokeh levels in the same scene, which are hard to acquire. To overcome this limitation, we develop a simulation-based pipeline to generate large-scale image pairs with varying focus planes and bokeh levels. With the simulated data, we find that training with only a vanilla diffusion objective often leads to incorrect DoF behaviors due to the complexity of the task. This requires a stronger constraint during training. Inspired by the photographic principle that photos of different focus planes can be linearly blended into a multi-focus image, we propose a stacking constraint during training to enforce precise DoF manipulation. This constraint enhances model training by imposing physically grounded refocusing behavior that the focusing results should be faithfully aligned with the scene structure and the camera conditions so that they can be combined into the correct multi-focus image. We also construct a benchmark to evaluate the effectiveness of our refocusing model. Extensive experiments demonstrate that DiffCamera supports stable refocusing across a wide range of scenes, providing unprecedented control over DoF adjustments for photography and generative AI applications.

DiffCamera: Arbitrary Refocusing on Images

TL;DR

DoF control on existing images is difficult with traditional prompts or single-image edits. DiffCamera introduces a diffusion-transformer framework conditioned on an arbitrary focus point and blur level, trained on large-scale simulated DoF pairs, and regularized by a stacking-based constraint and depth dropout to ensure physically consistent refocusing. The approach delivers robust, high-precision refocusing across diverse scenes and depths, with a dedicated 150-scene benchmark and multiple evaluation metrics demonstrating improved DoF manipulation and content fidelity. This work enables practical, flexible post-processing and enhancements for generative AI systems that require controllable depth-of-field effects.

Abstract

The depth-of-field (DoF) effect, which introduces aesthetically pleasing blur, enhances photographic quality but is fixed and difficult to modify once the image has been created. This becomes problematic when the applied blur is undesirable~(e.g., the subject is out of focus). To address this, we propose DiffCamera, a model that enables flexible refocusing of a created image conditioned on an arbitrary new focus point and a blur level. Specifically, we design a diffusion transformer framework for refocusing learning. However, the training requires pairs of data with different focus planes and bokeh levels in the same scene, which are hard to acquire. To overcome this limitation, we develop a simulation-based pipeline to generate large-scale image pairs with varying focus planes and bokeh levels. With the simulated data, we find that training with only a vanilla diffusion objective often leads to incorrect DoF behaviors due to the complexity of the task. This requires a stronger constraint during training. Inspired by the photographic principle that photos of different focus planes can be linearly blended into a multi-focus image, we propose a stacking constraint during training to enforce precise DoF manipulation. This constraint enhances model training by imposing physically grounded refocusing behavior that the focusing results should be faithfully aligned with the scene structure and the camera conditions so that they can be combined into the correct multi-focus image. We also construct a benchmark to evaluate the effectiveness of our refocusing model. Extensive experiments demonstrate that DiffCamera supports stable refocusing across a wide range of scenes, providing unprecedented control over DoF adjustments for photography and generative AI applications.

Paper Structure

This paper contains 20 sections, 9 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Pipeline of DiffCamera. We convert the image and camera conditions into tokens using a VAE encoder or a learnable linear projection and input them into a diffusion transformer as shown on the left side. The right side visualizes the learning objective of the stacking constraint, where the two diffusion transformers share the same weights. The VAEs are all frozen and the diffusion transformer is trainable. The meaning of the symbols can be found at \ref{['formulation: 1']}.
  • Figure 2: Qualitative comparisons on refocusing and adding bokeh. We perform refocusing on images exhibiting strong defocus blur, setting the blur level to zero and fixing the focus point at the image center.
  • Figure 3: Qualitative comparisons on bokeh removing (deblur). We refocus on images with defocus blur, setting the blur level to zero, and fixing the focus point at the image center. We compare it with the SOTA deblur method Restormer and the image editing ability of GPT-4o.
  • Figure 4: Qualitative ablation studies on the stacking constraint. Without the stacking constraint, we observe incorrect model behaviors in generating bokeh effects: in the first row, it fails to make the target stone area in focus; in the second row, the background is clear when bokeh=15, and most of the front part of the boat is blurry when bokeh=20, though it's in the focus plane. This shows that the stacking constraint enforces DoF condition consistency.
  • Figure 5: Qualitative studies on depth dropout. We demonstrate that depth dropout enhances DiffCamera's robustness to inaccurate depth maps when simulating bokeh effects, outperforming the traditional bokeh simulator BokehMe and a variant of DiffCamera without depth dropout.
  • ...and 7 more figures