Table of Contents
Fetching ...

RoomEditor++: A Parameter-Sharing Diffusion Architecture for High-Fidelity Furniture Synthesis

Qilong Wang, Xiaofan Ming, Zhenyi Lin, Jinwen Li, Dongwei Ren, Wangmeng Zuo, Qinghua Hu

TL;DR

This work addresses the lack of public benchmarks and feature misalignment in furniture synthesis by releasing RoomBench++, a large open dataset with realistic-scene and real-scene subsets, and introducing RoomEditor++, a parameter-sharing diffusion architecture that unifies reference and background processing. The shared-backbone design improves feature consistency and enables precise geometric and textural integration across U-Net and DiT backbones, achieving state-of-the-art results on RoomBench++ and strong generalization to unseen indoor scenes and related domains. Comprehensive experiments, including quantitative metrics, human studies, and cross-dataset evaluations (3D-FUTURE and DreamBooth), validate the method’s superiority and robustness, with ablations underscoring the value of dataset scale and architectural sharing. The work advances practical furniture synthesis for home design and e-commerce by providing an open benchmark and a scalable, generalizable diffusion-based solution.

Abstract

Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at \url{https://github.com/stonecutter-21/roomeditor}.

RoomEditor++: A Parameter-Sharing Diffusion Architecture for High-Fidelity Furniture Synthesis

TL;DR

This work addresses the lack of public benchmarks and feature misalignment in furniture synthesis by releasing RoomBench++, a large open dataset with realistic-scene and real-scene subsets, and introducing RoomEditor++, a parameter-sharing diffusion architecture that unifies reference and background processing. The shared-backbone design improves feature consistency and enables precise geometric and textural integration across U-Net and DiT backbones, achieving state-of-the-art results on RoomBench++ and strong generalization to unseen indoor scenes and related domains. Comprehensive experiments, including quantitative metrics, human studies, and cross-dataset evaluations (3D-FUTURE and DreamBooth), validate the method’s superiority and robustness, with ablations underscoring the value of dataset scale and architectural sharing. The work advances practical furniture synthesis for home design and e-commerce by providing an open benchmark and a scalable, generalizable diffusion-based solution.

Abstract

Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at \url{https://github.com/stonecutter-21/roomeditor}.

Paper Structure

This paper contains 34 sections, 22 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Furniture synthesis with our RoomEditor++ integrates reference objects into environments with geometric coherence and visual fidelity. Moreover, RoomEditor++ exhibits remarkable generalization capabilities across a wide range of unseen scenes and objects without task-specific fine-tuning.
  • Figure 2: The annotations for Roombench++. For the realistic-scene data, the labels are manually annotated. In contrast, for the real-scene data, the annotations are produced using the multimodal segmentation model Sa2Va sa2va.
  • Figure 3: Overview of constructing realistic-scene subset: (a) data construction and (b) furniture categories. As shown in (a), after classifying the images as either product or background images, we employed GPT-4o achiam2023gpt to assist with data filtering. (b) shows the statistics of categories in realistic-scene subset.
  • Figure 4: Overview of constructing real-scene subset: (a) data construction and (b) furniture categories. In (a), after frame extraction from the video data, it is processed using multimodal large models and traditional machine learning methods, resulting in a nearly fully automated pipeline to obtain the final real-world scene dataset. (b) shows the statistics of categories in real-scene subset.
  • Figure 5: The architecture of our RoomEditor++. Our method shares parameters between the two diffusion backbones for unified feature space learning. As shown, reference features propagate independently, while background features interact with reference features through a self-attention module at each layer, ensuring effective feature alignment.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Remark 1