Table of Contents
Fetching ...

MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World

Ankit Dhiman, Manan Shah, R Venkatesh Babu

TL;DR

MirrorFusion 2.0 tackles the challenge of photorealistic mirror reflections by coupling a richly augmented synthetic dataset, SynMirrorV2, with a depth-conditioned, dual-branch diffusion inpainting model trained through a three-stage curriculum. The approach progressively exposes the model to single and multiple-object scenes before bridging to real-world data, achieving improved geometric fidelity, shading, and occlusion handling over prior diffusion methods. Quantitative metrics (PSNR, SSIM, LPIPS, CLIP similarity) and a user study show substantial gains on MirrorBenchV2, GSO, and MSD datasets, indicating better generalization to real-world imagery. While effective, the method still encounters artifacts in very cluttered scenes, motivating future work on further data augmentation and curriculum refinements to close remaining gaps in realism and consistency.

Abstract

Diffusion models have become central to various image editing tasks, yet they often fail to fully adhere to physical laws, particularly with effects like shadows, reflections, and occlusions. In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. Despite extensive training data, existing diffusion models frequently overlook the nuanced details crucial to authentic mirror reflections. Recent approaches have attempted to resolve this by creating synhetic datasets and framing reflection generation as an inpainting task; however, they struggle to generalize across different object orientations and positions relative to the mirror. Our method overcomes these limitations by introducing key augmentations into the synthetic data pipeline: (1) random object positioning, (2) randomized rotations, and (3) grounding of objects, significantly enhancing generalization across poses and placements. To further address spatial relationships and occlusions in scenes with multiple objects, we implement a strategy to pair objects during dataset generation, resulting in a dataset robust enough to handle these complex scenarios. Achieving generalization to real-world scenes remains a challenge, so we introduce a three-stage training curriculum to develop the MirrorFusion 2.0 model to improve real-world performance. We provide extensive qualitative and quantitative evaluations to support our approach. The project page is available at: https://mirror-verse.github.io/.

MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World

TL;DR

MirrorFusion 2.0 tackles the challenge of photorealistic mirror reflections by coupling a richly augmented synthetic dataset, SynMirrorV2, with a depth-conditioned, dual-branch diffusion inpainting model trained through a three-stage curriculum. The approach progressively exposes the model to single and multiple-object scenes before bridging to real-world data, achieving improved geometric fidelity, shading, and occlusion handling over prior diffusion methods. Quantitative metrics (PSNR, SSIM, LPIPS, CLIP similarity) and a user study show substantial gains on MirrorBenchV2, GSO, and MSD datasets, indicating better generalization to real-world imagery. While effective, the method still encounters artifacts in very cluttered scenes, motivating future work on further data augmentation and curriculum refinements to close remaining gaps in realism and consistency.

Abstract

Diffusion models have become central to various image editing tasks, yet they often fail to fully adhere to physical laws, particularly with effects like shadows, reflections, and occlusions. In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. Despite extensive training data, existing diffusion models frequently overlook the nuanced details crucial to authentic mirror reflections. Recent approaches have attempted to resolve this by creating synhetic datasets and framing reflection generation as an inpainting task; however, they struggle to generalize across different object orientations and positions relative to the mirror. Our method overcomes these limitations by introducing key augmentations into the synthetic data pipeline: (1) random object positioning, (2) randomized rotations, and (3) grounding of objects, significantly enhancing generalization across poses and placements. To further address spatial relationships and occlusions in scenes with multiple objects, we implement a strategy to pair objects during dataset generation, resulting in a dataset robust enough to handle these complex scenarios. Achieving generalization to real-world scenes remains a challenge, so we introduce a three-stage training curriculum to develop the MirrorFusion 2.0 model to improve real-world performance. We provide extensive qualitative and quantitative evaluations to support our approach. The project page is available at: https://mirror-verse.github.io/.

Paper Structure

This paper contains 33 sections, 2 equations, 19 figures, 5 tables, 3 algorithms.

Figures (19)

  • Figure 1: Our model MirrorFusion 2.0, trained on our enhanced dataset SynMirrorV2 surpasses previous state-of-the-art diffusion-based inpainting models at the task of generating mirror reflections. All images were created by appending the prompt: "A perfect plane mirror reflection of " to the object description. All text prompts can be found in the supplementary.
  • Figure 2: We observe that current state-of-the-art T2I models, SD3.5 sd35 (top row) and Flux flux (bottom row), face significant challenges in producing consistent and geometrically accurate reflections when prompted to generate reflections in the scene.
  • Figure 3: Dataset Generation Pipeline. Our dataset generation pipeline introduces key augmentations such as random positioning, rotation, and grounding of objects within the scene using the 3D-Positioner. Additionally, we pair objects in semantically consistent combinations to simulate complex spatial relationships and occlusions, capturing realistic interactions for multi-object scenes.
  • Figure 4: Comparison on MirrorBenchV2. The baseline fails to maintain accurate reflections and spatial consistency, showing (a) incorrect chair orientation and (b) distorted reflections of multiple objects. In contrast, our method correctly renders (a) the chair and (b) the sofas with accurate position, orientation, and structure, demonstrating superior performance.
  • Figure 5: Comparison on GSO downs2022google dataset. In (a), the baseline method misrepresents object structure, while our method preserves spatial integrity and produces realistic reflections. In (b), the baseline yields incomplete and distorted reflections of the mug, whereas our approach generates accurate geometry, color, and detail, showing superior performance on out-of-distribution objects.
  • ...and 14 more figures