GenMM: Geometrically and Temporally Consistent Multimodal Data Generation for Video and LiDAR
Bharat Singh, Viveka Kulharia, Luyu Yang, Avinash Ravichandran, Ambrish Tyagi, Ashish Shrivastava
TL;DR
GenMM tackles the scarcity of coherent RGB-LiDAR synthetic data by enabling temporally and geometrically consistent insertion of 3D objects into videos and their corresponding LiDAR scans. The approach integrates a reference image and 3D bounding boxes to drive diffusion-based video inpainting, semantic boundary and depth estimation, and a geometry-based LiDAR surface optimization that constrains the inserted object to the target bounding box while updating LiDAR rays for depth consistency. The method is validated on animating, swapping, and inserting objects with objective video metrics (LPIPS, SSIM, FVD) and LiDAR point-wise/3D reconstruction metrics, showing clear improvements over baselines like LoRA and AnimateDiff-CLIP. The work advances realistic multimodal data generation for autonomous driving, robotics, and AR/VR by maintaining high-fidelity appearance and geometry across modalities, enabling more robust downstream perception and planning models.
Abstract
Multimodal synthetic data generation is crucial in domains such as autonomous driving, robotics, augmented/virtual reality, and retail. We propose a novel approach, GenMM, for jointly editing RGB videos and LiDAR scans by inserting temporally and geometrically consistent 3D objects. Our method uses a reference image and 3D bounding boxes to seamlessly insert and blend new objects into target videos. We inpaint the 2D Regions of Interest (consistent with 3D boxes) using a diffusion-based video inpainting model. We then compute semantic boundaries of the object and estimate it's surface depth using state-of-the-art semantic segmentation and monocular depth estimation techniques. Subsequently, we employ a geometry-based optimization algorithm to recover the 3D shape of the object's surface, ensuring it fits precisely within the 3D bounding box. Finally, LiDAR rays intersecting with the new object surface are updated to reflect consistent depths with its geometry. Our experiments demonstrate the effectiveness of GenMM in inserting various 3D objects across video and LiDAR modalities.
