MObI: Multimodal Object Inpainting Using Diffusion Models

Alexandru Buburuzan; Anuj Sharma; John Redford; Puneet K. Dokania; Romain Mueller

MObI: Multimodal Object Inpainting Using Diffusion Models

Alexandru Buburuzan, Anuj Sharma, John Redford, Puneet K. Dokania, Romain Mueller

TL;DR

MObI presents a diffusion-based framework for multimodal object inpainting that jointly edits camera and lidar data conditioned on a single reference image and a precise 3D bounding box. By extending Paint-by-Example with 3D box conditioning, modality-specific encoders, and gated cross-modal attention, it achieves realistic, semantically coherent insertions and replacements across modalities. The method demonstrates strong controllability and multimodal consistency, validated through qualitative results, realism metrics for camera and lidar, and downstream object-detection assessments on reinserted objects. Limitations include open-world generalization and potential background edits when conditioning on a single box, pointing to future work on full-scene conditioning and broader datasets. Overall, MObI offers a practical tool for generating realistic multimodal counterfactuals to stress-test perception systems in autonomous driving.)

Abstract

Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.

MObI: Multimodal Object Inpainting Using Diffusion Models

TL;DR

Abstract

Paper Structure (47 sections, 10 equations, 15 figures, 3 tables)

This paper contains 47 sections, 10 equations, 15 figures, 3 tables.

Introduction
Method
Multimodal encoding
Image encoding
Reference encoding and extraction
Lidar encoding
Multimodal generation
Inference and compositing
Training details
Sample selection
Reference selection
Augmentation
Fine-tuning procedure
Experiments
Object insertion and replacement
...and 32 more sections

Figures (15)

Figure 1: Our method can inpaint objects with a high degree of realism and controllability. Left: object inpainting methods based on edit masks alone such as Paint-by-Example yang2023paint (PbE) achieve high realism but can lead to surprising results because there are often multiple semantically consistent ways to inpaint an object within a scene. Right: methods based on 3D reconstruction such as NeuRAD tonderski2024neurad have strong controllability but sometimes lead to low realism, especially for object viewpoints that have not been observed. Our method achieves both high semantic consistency and controllability of the generation.
Figure 2: MObI architecture and training procedure.
Figure 3: Examples of object inpainting using MObI in the following settings: replacement (rows 1--4), insertion (row 5), and deletion (row 6, using a black reference). Our model can inpaint objects corresponding to a 3D bounding box with a high degree of realism while preserving coherence with the rest of the scene. Note that even though some references are from a different domain (time of day, weather condition), the model is able to preserve coherence of the resulting insertion.
Figure 4: Our method can generate multiple novel views from a single reference image while maintaining multimodal consistency. From left to right: reference image $\mathbf{x}_{\text{ref}}$, extracted from a separate scene; original destination scene with the RGB image $\mathbf{x}^{\text{(C)}}$ and lidar range depth $\mathbf{x}_0^{\text{(R)}}$; and edited scenes. Note, the inpainted pedestrian moves to the right between frames, shifting the background to the left. Check \ref{['fig:suppl:rotation_results']} for extended results, including intensity.
Figure 5: Spatial compositing of camera-lidar object inpainting. Note that some background points are not overridden due to lidar reflections on the hood of the inserted car (bottom).
...and 10 more figures

MObI: Multimodal Object Inpainting Using Diffusion Models

TL;DR

Abstract

MObI: Multimodal Object Inpainting Using Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)