Table of Contents
Fetching ...

GenMM: Geometrically and Temporally Consistent Multimodal Data Generation for Video and LiDAR

Bharat Singh, Viveka Kulharia, Luyu Yang, Avinash Ravichandran, Ambrish Tyagi, Ashish Shrivastava

TL;DR

GenMM tackles the scarcity of coherent RGB-LiDAR synthetic data by enabling temporally and geometrically consistent insertion of 3D objects into videos and their corresponding LiDAR scans. The approach integrates a reference image and 3D bounding boxes to drive diffusion-based video inpainting, semantic boundary and depth estimation, and a geometry-based LiDAR surface optimization that constrains the inserted object to the target bounding box while updating LiDAR rays for depth consistency. The method is validated on animating, swapping, and inserting objects with objective video metrics (LPIPS, SSIM, FVD) and LiDAR point-wise/3D reconstruction metrics, showing clear improvements over baselines like LoRA and AnimateDiff-CLIP. The work advances realistic multimodal data generation for autonomous driving, robotics, and AR/VR by maintaining high-fidelity appearance and geometry across modalities, enabling more robust downstream perception and planning models.

Abstract

Multimodal synthetic data generation is crucial in domains such as autonomous driving, robotics, augmented/virtual reality, and retail. We propose a novel approach, GenMM, for jointly editing RGB videos and LiDAR scans by inserting temporally and geometrically consistent 3D objects. Our method uses a reference image and 3D bounding boxes to seamlessly insert and blend new objects into target videos. We inpaint the 2D Regions of Interest (consistent with 3D boxes) using a diffusion-based video inpainting model. We then compute semantic boundaries of the object and estimate it's surface depth using state-of-the-art semantic segmentation and monocular depth estimation techniques. Subsequently, we employ a geometry-based optimization algorithm to recover the 3D shape of the object's surface, ensuring it fits precisely within the 3D bounding box. Finally, LiDAR rays intersecting with the new object surface are updated to reflect consistent depths with its geometry. Our experiments demonstrate the effectiveness of GenMM in inserting various 3D objects across video and LiDAR modalities.

GenMM: Geometrically and Temporally Consistent Multimodal Data Generation for Video and LiDAR

TL;DR

GenMM tackles the scarcity of coherent RGB-LiDAR synthetic data by enabling temporally and geometrically consistent insertion of 3D objects into videos and their corresponding LiDAR scans. The approach integrates a reference image and 3D bounding boxes to drive diffusion-based video inpainting, semantic boundary and depth estimation, and a geometry-based LiDAR surface optimization that constrains the inserted object to the target bounding box while updating LiDAR rays for depth consistency. The method is validated on animating, swapping, and inserting objects with objective video metrics (LPIPS, SSIM, FVD) and LiDAR point-wise/3D reconstruction metrics, showing clear improvements over baselines like LoRA and AnimateDiff-CLIP. The work advances realistic multimodal data generation for autonomous driving, robotics, and AR/VR by maintaining high-fidelity appearance and geometry across modalities, enabling more robust downstream perception and planning models.

Abstract

Multimodal synthetic data generation is crucial in domains such as autonomous driving, robotics, augmented/virtual reality, and retail. We propose a novel approach, GenMM, for jointly editing RGB videos and LiDAR scans by inserting temporally and geometrically consistent 3D objects. Our method uses a reference image and 3D bounding boxes to seamlessly insert and blend new objects into target videos. We inpaint the 2D Regions of Interest (consistent with 3D boxes) using a diffusion-based video inpainting model. We then compute semantic boundaries of the object and estimate it's surface depth using state-of-the-art semantic segmentation and monocular depth estimation techniques. Subsequently, we employ a geometry-based optimization algorithm to recover the 3D shape of the object's surface, ensuring it fits precisely within the 3D bounding box. Finally, LiDAR rays intersecting with the new object surface are updated to reflect consistent depths with its geometry. Our experiments demonstrate the effectiveness of GenMM in inserting various 3D objects across video and LiDAR modalities.
Paper Structure (16 sections, 2 equations, 15 figures, 2 tables)

This paper contains 16 sections, 2 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Overview of the proposed GenMM method. Given a target video and its corresponding LiDAR frames, we project 3D boxes onto the 2D image to create masked RoIs, which are combined with the original image to produce masked input images. These cropped masked inputs, along with their respective masks, are then processed through our video inpainting method (Figure \ref{['fig::video_inpaint']}), which generates frames with the inpainted objects. These inpainted crops are then blended back into the target video. The inpainted crops and the LiDAR point cloud are input to a geometry-based LiDAR inpainting algorithm (Figure \ref{['fig::lidar_inpaint']}) that generates the corresponding LiDAR points for the inserted objects.
  • Figure 2: Video inpainting using a reference image. We inpaint objects using a reference image, object masks, and a masked input image to ensure realistic object insertion. The Inpainting-Unet network utilizes concatenated features from the object mask, masked latents, and noisy latents. Following the approach in hu2023animate, we employ spatial-attention layers (using ReferenceNet features) to ensure appearance consistency between the reference image and the inpainted objects, along with temporal-attention layers to maintain temporal consistency.
  • Figure 3: Overview of geometry-based LiDAR inpainting approach. Pixels on inserted 2D object are lifted to 3D with proper scale and shift of depth to fit the target 3D bounding box. We voxelize the 3D bounding boxes to represent the 3D object surface. Each voxel within the bounding box is classified as either occupied or empty. LiDAR rays that intersects with an occupied voxel are updated with the correct range corresponding to the inserted object.
  • Figure 4: Examples of animating reference crops in videos. Given a reference crop from first frame (right), we inpaint the object in subsequent frames. We mask out the RoI in the target image and generate the object inside the RoI conditioned on the reference crop. The model can learn to generate temporally consistent videos, without specifying control conditions like object pose or edge-map.
  • Figure 5: Examples of swapping objects in videos. Each row shows a separate example. The red box around the object in the leftmost column image indicates the object that needs to be replaced, and the green box on the right images highlights the inserted object using reference. In the first row, we see that though the reference image is from a cloudy scene and does not have prominent shadows, our method is able to re-light the object in a new environment. We are also able to insert pedestrians and learn their walking patterns without specifying spatial controls, such as OpenPose or depth.
  • ...and 10 more figures