Table of Contents
Fetching ...

Rethinking Referring Object Removal

Xiangtian Xue, Jiasong Wu, Youyong Kong, Lotfi Senhadji, Huazhong Shu

TL;DR

The ComCOCO, a synthetic dataset consisting of 136,495 referring expressions for 34,615 objects in 23,951 image pairs, is constructed and an end-to-end syntax-aware hybrid mapping network with an encoding-decoding structure is proposed.

Abstract

Referring object removal refers to removing the specific object in an image referred by natural language expressions and filling the missing region with reasonable semantics. To address this task, we construct the ComCOCO, a synthetic dataset consisting of 136,495 referring expressions for 34,615 objects in 23,951 image pairs. Each pair contains an image with referring expressions and the ground truth after elimination. We further propose an end-to-end syntax-aware hybrid mapping network with an encoding-decoding structure. Linguistic features are hierarchically extracted at the syntactic level and fused in the downsampling process of visual features with multi-head attention. The feature-aligned pyramid network is leveraged to generate segmentation masks and replace internal pixels with region affinity learned from external semantics in high-level feature maps. Extensive experiments demonstrate that our model outperforms diffusion models and two-stage methods which process the segmentation and inpainting task separately by a significant margin.

Rethinking Referring Object Removal

TL;DR

The ComCOCO, a synthetic dataset consisting of 136,495 referring expressions for 34,615 objects in 23,951 image pairs, is constructed and an end-to-end syntax-aware hybrid mapping network with an encoding-decoding structure is proposed.

Abstract

Referring object removal refers to removing the specific object in an image referred by natural language expressions and filling the missing region with reasonable semantics. To address this task, we construct the ComCOCO, a synthetic dataset consisting of 136,495 referring expressions for 34,615 objects in 23,951 image pairs. Each pair contains an image with referring expressions and the ground truth after elimination. We further propose an end-to-end syntax-aware hybrid mapping network with an encoding-decoding structure. Linguistic features are hierarchically extracted at the syntactic level and fused in the downsampling process of visual features with multi-head attention. The feature-aligned pyramid network is leveraged to generate segmentation masks and replace internal pixels with region affinity learned from external semantics in high-level feature maps. Extensive experiments demonstrate that our model outperforms diffusion models and two-stage methods which process the segmentation and inpainting task separately by a significant margin.
Paper Structure (28 sections, 14 equations, 12 figures, 10 tables)

This paper contains 28 sections, 14 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The construction process of the ComCOCO dataset, which contains Scene Matching, Object Placement and Style Consistency. The two selected images have similar semantic scenes, while the characteristics of description statements, that is, the morphological characteristics of objects, are completely disparate. Step 2 and Step 3 ensure the rationality of splicing to the greatest extent. Processed image pair contains an image with referring expressions and the ground truth. We conduct dual-phase manual inspection in the pipeline to guarantee image fidelity and incorporate spatial-based descriptions.
  • Figure 2: The basic learning framework of the proposed SAHM. The overall framework adopts the encoding-decoding structure based on Swin-Transformer. Hierarchical linguistic features $L$, $L_{aw}$, and $L_{iw}$ are fused in successive stages of visual downsampling via syntax-aware visual attention. $Y_i$ consists of parallel segmentation feature map $F_i$ and inpainting feature map $I_i$, which are initialized by two Swin-Transformer blocks as the bottleneck and four residual blocks, respectively. In the hybrid mapping filling module, $S_{i}$ is mapped from $S_{i+1}$ and $V_{i}$ with the skip-connection and the mask region in $S_{i}$ is further filled with external semantic mapping.
  • Figure 3: The "identity words" point to the segmented object and are the core words in a sentence. The "attribute words" contain attribute information including location, appearance, etc. The remainder with no representative information is deemed redundant.
  • Figure 4: Exhibitions of removal results with different models.
  • Figure 5: Visualization of referring object removal results. Images in the first two rows are derived from ComCOCO, and our result is visually compared with the ground truth. The last two rows show realistic images with no gold standard. The first one in each group is the original image, and the next two images are our elimination results under the guidance of expressions.
  • ...and 7 more figures