Table of Contents
Fetching ...

ROSE: Remove Objects with Side Effects in Videos

Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao

TL;DR

ROSE tackles the challenge of removing objects from video while also eliminating side effects such as shadows, reflections, and lighting changes, addressing the scarcity of real paired data. It builds a fully automatic data pipeline using a 3D rendering engine to generate large synthetic video pairs and introduces ROSE-Bench for comprehensive evaluation of object–environment interactions. The method employs a diffusion-transformer video inpainting backbone with reference-based erasing, mask augmentation, and an explicit difference-mask predictor to localize affected regions, achieving state-of-the-art results and strong real-world generalization. These contributions advance video editing by enabling realistic object removal with environmental consistency and provide a standardized benchmark for evaluating complex side effects.

Abstract

Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.

ROSE: Remove Objects with Side Effects in Videos

TL;DR

ROSE tackles the challenge of removing objects from video while also eliminating side effects such as shadows, reflections, and lighting changes, addressing the scarcity of real paired data. It builds a fully automatic data pipeline using a 3D rendering engine to generate large synthetic video pairs and introduces ROSE-Bench for comprehensive evaluation of object–environment interactions. The method employs a diffusion-transformer video inpainting backbone with reference-based erasing, mask augmentation, and an explicit difference-mask predictor to localize affected regions, achieving state-of-the-art results and strong real-world generalization. These contributions advance video editing by enabling realistic object removal with environmental consistency and provide a standardized benchmark for evaluating complex side effects.

Abstract

Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.

Paper Structure

This paper contains 23 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Video object removal results generated by ROSE (zoom in for better view). Every two lines are an example where the above is input video with mask and the bottom is inference result. We sequentially show cases of various side effects studied in this paper.
  • Figure 2: Paired video preparation pipeline using 3D data, which can be divided into: scene and object sampling, multi-view generation with masks, valid view filtering and video data rendering.
  • Figure 3: Illustration of the various side-effect categories studied in the dataset of ROSE.
  • Figure 4: The framework of ROSE. We concatenate the noisy latents with the original input video and masks, consumed by a video inpainting model. An additional difference mask predictor is introduced to predict the correlated area in video, automatically computed from the input video pairs.
  • Figure 5: Visualization of various mask augmentation strategies adopted in training.
  • ...and 2 more figures