Table of Contents
Fetching ...

Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

Ankit Dhiman, Manan Shah, Rishubh Parihar, Yash Bhalgat, Lokesh R Boregowda, R Venkatesh Babu

TL;DR

The paper tackles generating faithful mirror reflections with diffusion models by framing it as an image inpainting task conditioned on depth. It introduces SynMirror, a large-scale synthetic dataset, and MirrorBench for evaluation, along with MirrorFusion, a depth-conditioned diffusion inpainting model that achieves geometrically consistent reflections. MirrorFusion outperforms zero-shot baselines and depth-agnostic inpainting approaches on reflective tasks, demonstrating robust 3D-aware generation and generalization to unseen objects. This work enables controllable, photo-realistic mirror reflections for image editing and augmented reality, while acknowledging limitations and societal considerations of diffusion-based generation.

Abstract

We tackle the problem of generating highly realistic and plausible mirror reflections using diffusion-based generative models. We formulate this problem as an image inpainting task, allowing for more user control over the placement of mirrors during the generation process. To enable this, we create SynMirror, a large-scale dataset of diverse synthetic scenes with objects placed in front of mirrors. SynMirror contains around 198k samples rendered from 66k unique 3D objects, along with their associated depth maps, normal maps and instance-wise segmentation masks, to capture relevant geometric properties of the scene. Using this dataset, we propose a novel depth-conditioned inpainting method called MirrorFusion, which generates high-quality, realistic, shape and appearance-aware reflections of real-world objects. MirrorFusion outperforms state-of-the-art methods on SynMirror, as demonstrated by extensive quantitative and qualitative analysis. To the best of our knowledge, we are the first to successfully tackle the challenging problem of generating controlled and faithful mirror reflections of an object in a scene using diffusion-based models. SynMirror and MirrorFusion open up new avenues for image editing and augmented reality applications for practitioners and researchers alike. The project page is available at: https://val.cds.iisc.ac.in/reflecting-reality.github.io/.

Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

TL;DR

The paper tackles generating faithful mirror reflections with diffusion models by framing it as an image inpainting task conditioned on depth. It introduces SynMirror, a large-scale synthetic dataset, and MirrorBench for evaluation, along with MirrorFusion, a depth-conditioned diffusion inpainting model that achieves geometrically consistent reflections. MirrorFusion outperforms zero-shot baselines and depth-agnostic inpainting approaches on reflective tasks, demonstrating robust 3D-aware generation and generalization to unseen objects. This work enables controllable, photo-realistic mirror reflections for image editing and augmented reality, while acknowledging limitations and societal considerations of diffusion-based generation.

Abstract

We tackle the problem of generating highly realistic and plausible mirror reflections using diffusion-based generative models. We formulate this problem as an image inpainting task, allowing for more user control over the placement of mirrors during the generation process. To enable this, we create SynMirror, a large-scale dataset of diverse synthetic scenes with objects placed in front of mirrors. SynMirror contains around 198k samples rendered from 66k unique 3D objects, along with their associated depth maps, normal maps and instance-wise segmentation masks, to capture relevant geometric properties of the scene. Using this dataset, we propose a novel depth-conditioned inpainting method called MirrorFusion, which generates high-quality, realistic, shape and appearance-aware reflections of real-world objects. MirrorFusion outperforms state-of-the-art methods on SynMirror, as demonstrated by extensive quantitative and qualitative analysis. To the best of our knowledge, we are the first to successfully tackle the challenging problem of generating controlled and faithful mirror reflections of an object in a scene using diffusion-based models. SynMirror and MirrorFusion open up new avenues for image editing and augmented reality applications for practitioners and researchers alike. The project page is available at: https://val.cds.iisc.ac.in/reflecting-reality.github.io/.
Paper Structure (34 sections, 3 equations, 19 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 3 equations, 19 figures, 3 tables, 1 algorithm.

Figures (19)

  • Figure 1: We present MirrorFusion, a diffusion-based inpainting model, which generates high-quality geometrically consistent and photo-realistic mirror reflections given an input image and a mask depicting the mirror region. Our method shows superior quality generations as compared to previous state-of-the-art diffusion-based text-to-image and inpainting methods. All the images were generated by prefixing the mirror text prompt: "A perfect plain mirror reflection of " to the input object description.
  • Figure 2: Images generated from Stable Diffusion 2.1 rombach2022high. Text-to-image models, when prompted to generate reflections, struggle to generate consistent and controlled mirror reflections.
  • Figure 3: SynMirror: a) Dataset creation pipeline - We sample diverse 3D objects, mirrors as 2D planes and diverse floor textures to compose a scene in a blender environment. To enhance realism, we sample high-quality HDRI environment maps as backgrounds. We sample cameras from varied viewpoints, capturing the mirror and the object, and use Blender to render RGB images and dense 2D annotations. b) Samples from SynMirror - The generated scenes have complex geometry, textures, and high diversity. The renderings have accurate dense annotations for semantic, depth and normal maps at the original image resolution.
  • Figure 4: Overview of the architecture. We encode the input image $x$ using a pre-trained image encoder from Stable Diffusion to get $z_m$. Subsequently, we resize the mirror mask $m$ and depth map $d$ to obtain resized mask $x_m$ and depth $x_d$. Then, we concatenate noisy latents $z_t$, $z_m$, $x_m$ and $x_d$ which are fed into the Conditioning U-Net $\epsilon^{'}_{\theta}$. Each layer of the Generation U-Net $\epsilon_{\theta}$ is conditioned via zero convolutions with corresponding layers of $\epsilon^{'}_{\theta}$. Additionally, $\epsilon_{\theta}$ is conditioned by text embeddings. The pre-trained decoder then decodes the denoised latent to produce an image with mirror reflections. Detailed information can be found in Sec. \ref{['subsec:method']}
  • Figure 5: Impact of depth conditioning on the reflection generation quality. Notice the irregular shape of the "baseball" and "chair" marked in red. In comparison, our method preserves the structure of the object (marked in green).
  • ...and 14 more figures