Table of Contents
Fetching ...

RORem: Training a Robust Object Remover with Human-in-the-Loop

Ruibin Li, Tao Yang, Song Guo, Lei Zhang

TL;DR

RORem tackles unreliable object removal by combining a semi-supervised, human-in-the-loop data pipeline with diffusion-based inpainting. It initializes from 60K triplets, iteratively refines data quality via human and automated annotations to reach ~200K high-quality pairs, and fine-tunes SDXL to create a robust remover. A discriminator trained on human labels guides automated data curation, and distillation speeds inference to four diffusion steps (~0.5s). Empirical results show state-of-the-art removal reliability and image quality, with an ~18% gain in human-perceived success over prior methods, while acknowledging limitations in complex backgrounds and pointing to future work with advanced foundation models.

Abstract

Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18\%. The dataset, source code and trained model are available at https://github.com/leeruibin/RORem.

RORem: Training a Robust Object Remover with Human-in-the-Loop

TL;DR

RORem tackles unreliable object removal by combining a semi-supervised, human-in-the-loop data pipeline with diffusion-based inpainting. It initializes from 60K triplets, iteratively refines data quality via human and automated annotations to reach ~200K high-quality pairs, and fine-tunes SDXL to create a robust remover. A discriminator trained on human labels guides automated data curation, and distillation speeds inference to four diffusion steps (~0.5s). Empirical results show state-of-the-art removal reliability and image quality, with an ~18% gain in human-perceived success over prior methods, while acknowledging limitations in complex backgrounds and pointing to future work with advanced foundation models.

Abstract

Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18\%. The dataset, source code and trained model are available at https://github.com/leeruibin/RORem.
Paper Structure (18 sections, 4 equations, 14 figures, 4 tables)

This paper contains 18 sections, 4 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Given an input image and a mask (see (a)), existing object removal methods such as PowerPaint zhuang2023task may inpaint the masked regions with other objects (see (b)), while our method can successfully remove the masked objects (see (c)).
  • Figure 2: Overview of our training data generation and model training process. In stage 1, we gather 60K training triplets from open-source datasets to train an initial removal model. In stage 2, we apply the trained model to a test set and engage human annotators to select high-quality samples to augment the training set. In stage 3, we train a discriminator using the human feedback data, and employ it to automatically annotate high quality training samples. We iterate stages 2&3 for several rounds, ultimately obtaining over 200K object removal training triplets as well as the trained model.
  • Figure 3: We finetune the pre-trained SDXL-inpaiting model with the standard diffusion training loss. We concatenate triplets data together as the model inputs. The same training paradigm is employed across all the three stages.
  • Figure 4: Training of the discriminator for automated data annotation. We use the down and middle blocks of SDXL-inpainting model as the base model, introduce trainable LoRA layers into it, and add several convolutional layers after them. Human feedback data are utilized to train the LoRA and convolutional layers.
  • Figure 5: Efficient model distillation. We integrate trainable LoRA layers into the trained RORem model, and fine-tune it by adapting the pipeline of latent-consistency-model (LCM) under the guidance of original RORem. The distilled model can perform high-quality object removal in four diffusion steps.
  • ...and 9 more figures