Object-level Scene Deocclusion

Zhengzhe Liu; Qing Liu; Chirui Chang; Jianming Zhang; Daniil Pakhomov; Haitian Zheng; Zhe Lin; Daniel Cohen-Or; Chi-Wing Fu

Object-level Scene Deocclusion

Zhengzhe Liu, Qing Liu, Chirui Chang, Jianming Zhang, Daniil Pakhomov, Haitian Zheng, Zhe Lin, Daniel Cohen-Or, Chi-Wing Fu

TL;DR

PACO introduces a self-supervised, object-level deocclusion framework built around a two-stage diffusion-based architecture. A Parallel Variational Autoencoder encodes a stack of full-view objects into a single full-view feature map, while a Visible-to-Complete Latent Generator, conditioned on partial-view features and object text prompts, generates the full-view feature map from partial inputs; an inference scheme uses layer-wise diffusion by depth to deocclude objects efficiently. Trained on a large synthetic OE dataset, PACO achieves state-of-the-art results on COCOA with strong generalization to ADE20k and novel scenes, and enables downstream tasks such as image recomposition and single-view 3D reconstruction. By leveraging pre-trained priors and text-conditioned guidance, PACO demonstrates high-fidelity, object-aware completion that surpasses traditional inpainting and previous amodal completion methods, with practical implications for editing and 3D scene understanding.

Abstract

Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition.

Object-level Scene Deocclusion

TL;DR

Abstract

Paper Structure (29 sections, 6 equations, 22 figures, 3 tables)

This paper contains 29 sections, 6 equations, 22 figures, 3 tables.

Related Work
Image Inpainting
Amodal Instance Segmentation
Amodal Appearance Completion
Occlusion Order Prediction
Diffusion Models
Overview
Methodology
Parallel Variational Autoencoder
Discussion.
Visible-to-Complete Latent Generator
Inference
Data Preparation for Self-Supervised Training
Dataset Creation
Training Strategy (Stage 1)
...and 14 more sections

Figures (22)

Figure 1: Overview of our PACO framework. (a) In the first training stage, we train the Parallel Variational Autoencoder$\{E_1,D_1\}$ to learn to encode a stack of complete (full-view) objects $\{O_i\}$ into full-view feature map $\hat{f}$ and the decoder $D_1$ to reconstruct the specific object $O_i$ for the partial query mask$m_i$. (b) In the second training stage, we train the Visible-to-Complete Latent Generator to generate full-view feature map $f$ conditioned on the partial-view features map $f_p$ from only segmented visible objects. (c) At inference, we employ the visible-to-complete latent generator to generate full-view feature map $f$ conditioned on partial-view feature map $f_p$ encoded from partial objects, then use $D_1$ to recover the amodal appearance $\tilde{O}_i$ with the partial mask $m_i$ as the query.
Figure 2: Detailed architecture of our visible-to-complete latent generator.
Figure 3: Illustration of the layer-wise deocclusion strategy. Given an image, we first determine the occlusion relation among the objects using a depth estimation technique. Then, for each depth layer, we deocclude all objects in the same depth layer simultaneously in a unified diffusion pass.
Figure 4: Qualitative comparison with SSSD zhan2020self. The arrows indicate the target object to be deoccluded and the completed object parts.
Figure 5: Qualitative comparison with existing work VINV zheng2021visiting. Results of (b) are directly taken from their paper. The recovered regions from VINV, i.e., the bottom of the cup (left) and the tail light of the car (right), are blurry, while our approach gives higher-quality deocclusion results.
...and 17 more figures

Object-level Scene Deocclusion

TL;DR

Abstract

Object-level Scene Deocclusion

Authors

TL;DR

Abstract

Table of Contents

Figures (22)