Object-level Scene Deocclusion
Zhengzhe Liu, Qing Liu, Chirui Chang, Jianming Zhang, Daniil Pakhomov, Haitian Zheng, Zhe Lin, Daniel Cohen-Or, Chi-Wing Fu
TL;DR
PACO introduces a self-supervised, object-level deocclusion framework built around a two-stage diffusion-based architecture. A Parallel Variational Autoencoder encodes a stack of full-view objects into a single full-view feature map, while a Visible-to-Complete Latent Generator, conditioned on partial-view features and object text prompts, generates the full-view feature map from partial inputs; an inference scheme uses layer-wise diffusion by depth to deocclude objects efficiently. Trained on a large synthetic OE dataset, PACO achieves state-of-the-art results on COCOA with strong generalization to ADE20k and novel scenes, and enables downstream tasks such as image recomposition and single-view 3D reconstruction. By leveraging pre-trained priors and text-conditioned guidance, PACO demonstrates high-fidelity, object-aware completion that surpasses traditional inpainting and previous amodal completion methods, with practical implications for editing and 3D scene understanding.
Abstract
Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition.
