Table of Contents
Fetching ...

An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas

Mira Slavcheva, Dave Gausebeck, Kevin Chen, David Buchhofer, Azwad Sabik, Chen Ma, Sachal Dhillon, Olaf Brandt, Alan Dolhasz

TL;DR

This work tackles automatic defurnishing of indoor panoramas by creating a domain-tuned inpainting pipeline built on Stable Diffusion. It foregrounds context-rich equirectangular panoramas, localizes furniture via semantic segmentation, and uses a robust inpainting model trained with synthetic shadows and a diverse set of prompts to mitigate hallucinations without relying on room layout estimation. A targeted pre-processing and a post-processing blend preserve high-frequency detail, resulting in crisper textures and lower perceptual distortion than competing methods, as demonstrated by quantitative metrics and qualitative examples. The approach advances practical digital twin workflows by enabling consistent, high-fidelity defurnishing suitable for real estate visualization, interior design exploration, and renovation planning.

Abstract

We propose a pipeline that leverages Stable Diffusion to improve inpainting results in the context of defurnishing -- the removal of furniture items from indoor panorama images. Specifically, we illustrate how increased context, domain-specific model fine-tuning, and improved image blending can produce high-fidelity inpaints that are geometrically plausible without needing to rely on room layout estimation. We demonstrate qualitative and quantitative improvements over other furniture removal techniques.

An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas

TL;DR

This work tackles automatic defurnishing of indoor panoramas by creating a domain-tuned inpainting pipeline built on Stable Diffusion. It foregrounds context-rich equirectangular panoramas, localizes furniture via semantic segmentation, and uses a robust inpainting model trained with synthetic shadows and a diverse set of prompts to mitigate hallucinations without relying on room layout estimation. A targeted pre-processing and a post-processing blend preserve high-frequency detail, resulting in crisper textures and lower perceptual distortion than competing methods, as demonstrated by quantitative metrics and qualitative examples. The approach advances practical digital twin workflows by enabling consistent, high-fidelity defurnishing suitable for real estate visualization, interior design exploration, and renovation planning.

Abstract

We propose a pipeline that leverages Stable Diffusion to improve inpainting results in the context of defurnishing -- the removal of furniture items from indoor panorama images. Specifically, we illustrate how increased context, domain-specific model fine-tuning, and improved image blending can produce high-fidelity inpaints that are geometrically plausible without needing to rely on room layout estimation. We demonstrate qualitative and quantitative improvements over other furniture removal techniques.
Paper Structure (24 sections, 1 equation, 16 figures, 1 table)

This paper contains 24 sections, 1 equation, 16 figures, 1 table.

Figures (16)

  • Figure 1: Indoor panorama defurnishing. Our custom fine-tuning of Stable Diffusion inpainting reduces its tendency to hallucinate objects near shadows and reflections, such as the radiators on the walls and lamp in the corner.
  • Figure 2: Defurnishing pipeline. The input to our system is a 8192$\times$4096 pixel equirectangular panorama, which we crop to 3:1 aspect ratio to exclude the poles. Pre-processing: We obtain a binary furniture mask via semantic segmentation. Both input and mask are rolled, so that inpainting regions are in the center of the image, and padded, to ensure sufficient context (note the repeated doorway and cupboard in the example). The images are then downsampled to a height of 512 pixels. Inpainting: Our custom process is robust to inexact masks and remnant shadows, as detailed in the method section. Post-processing: We apply 4$\times$ superresolution to the inpainted output and then blend the original and inpainted panoramas into the final result using the mask, keeping as much of the original resolution details as possible.
  • Figure 3: Training dataset examples. Synthetic furniture items and shadows are rendered over real unfurnished panoramas.
  • Figure 4: Effect of number of training prompts. Fewer than 32 prompts lead to hallucinations near the shadow on the right wall.
  • Figure 5: Initialization of latents for inpainting. With decreasing percentage of noise in the initialization, hallucinations in the hallway decrease, but blurriness on the inpainted floor increases.
  • ...and 11 more figures