Table of Contents
Fetching ...

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion

Khiem Vuong, N. Dinesh Reddy, Robert Tamburo, Srinivasa G. Narasimhan

TL;DR

Occlusion poses a major challenge for 2D/3D object understanding and amodal supervision is hard to obtain. WALT3D automatically generates realistic training data by mining unoccluded objects from time-lapse video, estimating their 3D pose and shape, and re-inserting them into the scene in a geometry-consistent clip-art fashion to produce rich 2D amodal annotations and 3D pseudo-groundtruth. Empirical results on vehicles and humans show clear gains in both 2D and 3D reconstruction under heavy occlusion, with strong data-efficiency and cross-dataset generalization. The approach is scalable, privacy-conscious (faces/license plates blurred), and can augment existing datasets to improve occlusion-robust perception in smart-city and robotics applications.

Abstract

Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion

TL;DR

Occlusion poses a major challenge for 2D/3D object understanding and amodal supervision is hard to obtain. WALT3D automatically generates realistic training data by mining unoccluded objects from time-lapse video, estimating their 3D pose and shape, and re-inserting them into the scene in a geometry-consistent clip-art fashion to produce rich 2D amodal annotations and 3D pseudo-groundtruth. Empirical results on vehicles and humans show clear gains in both 2D and 3D reconstruction under heavy occlusion, with strong data-efficiency and cross-dataset generalization. The approach is scalable, privacy-conscious (faces/license plates blurred), and can augment existing datasets to improve occlusion-robust perception in smart-city and robotics applications.

Abstract

Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.
Paper Structure (16 sections, 2 equations, 13 figures, 6 tables)

This paper contains 16 sections, 2 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Models trained on our automatically generated data from time-lapse imagery can reliably estimate amodal 2D bounding box, segmentation as well as 3D shape and pose despite the complex occlusions presented in the input image.
  • Figure 2: Given a time-lapse video, we automatically generate 2D/3D training data under severe occlusions. We start by detecting each object in the video, and unoccluded (fully visible) objects are identified. Each unoccluded object is then reconstructed using the ground plane and camera parameters. With the 3D pose, unoccluded objects are composited back into the same location (i.e., clip-art style) in a geometrically consistent approach. The composited image and its pseudo-groundtruth from off-the-shelf methods (e.g., segmentation, keypoints, shapes) are utilized to train a model that can produce accurate 2D/3D object reconstruction under severe occlusions.
  • Figure 3: Automatically generated 2D and 3D Clip-Art to supervise our network: Unoccluded objects are first mined using time-lapse imagery of WALT dataset Reddy_2022_CVPR. Non-intersecting unoccluded objects are composited back into the background image in their respective original positions to preserve correct appearances. The resulting clip-art images, along with their corresponding amodal pseudo-groundtruth information, such as segmentation, keypoints, depth/normal maps, and 3D shapes, are shown. Our method generates realistic appearances from any stationary camera, incorporating diverse viewing geometries, weather conditions, lighting, and occlusion configurations.
  • Figure 4: Comparison between images composited using the 2D-based method WALT2D Reddy_2022_CVPR (left) and our 3D-based method WALT3D (right). It is evident that our 3D-based compositing method generates realistic and geometrically accurate occlusion configurations, in contrast to the 2D-based method (e.g., cars and people overlapping in an unfeasible way).
  • Figure 5: Sample images from our new vehicle 2D keypoints dataset. The dataset contains a wide range of appearance variations including day and night and various traffic scenarios.
  • ...and 8 more figures