WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion
Khiem Vuong, N. Dinesh Reddy, Robert Tamburo, Srinivasa G. Narasimhan
TL;DR
Occlusion poses a major challenge for 2D/3D object understanding and amodal supervision is hard to obtain. WALT3D automatically generates realistic training data by mining unoccluded objects from time-lapse video, estimating their 3D pose and shape, and re-inserting them into the scene in a geometry-consistent clip-art fashion to produce rich 2D amodal annotations and 3D pseudo-groundtruth. Empirical results on vehicles and humans show clear gains in both 2D and 3D reconstruction under heavy occlusion, with strong data-efficiency and cross-dataset generalization. The approach is scalable, privacy-conscious (faces/license plates blurred), and can augment existing datasets to improve occlusion-robust perception in smart-city and robotics applications.
Abstract
Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.
