Table of Contents
Fetching ...

Learning Global Object-Centric Representations via Disentangled Slot Attention

Tonglin Chen, Yinxuan Huang, Zhimeng Shen, Jinghao Huang, Bin Li, Xiangyang Xue

TL;DR

Experimental results substantiate the efficacy of the proposed object-centric learning method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.

Abstract

Humans can discern scene-independent features of objects across various environments, allowing them to swiftly identify objects amidst changing factors such as lighting, perspective, size, and position and imagine the complete images of the same object in diverse settings. Existing object-centric learning methods only extract scene-dependent object-centric representations, lacking the ability to identify the same object across scenes as humans. Moreover, some existing methods discard the individual object generation capabilities to handle complex scenes. This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations. To learn the global object-centric representations that encapsulate globally invariant attributes of objects (i.e., the complete appearance and shape), this paper designs a Disentangled Slot Attention module to convert the scene features into scene-dependent attributes (such as scale, position and orientation) and scene-independent representations (i.e., appearance and shape). Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.

Learning Global Object-Centric Representations via Disentangled Slot Attention

TL;DR

Experimental results substantiate the efficacy of the proposed object-centric learning method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.

Abstract

Humans can discern scene-independent features of objects across various environments, allowing them to swiftly identify objects amidst changing factors such as lighting, perspective, size, and position and imagine the complete images of the same object in diverse settings. Existing object-centric learning methods only extract scene-dependent object-centric representations, lacking the ability to identify the same object across scenes as humans. Moreover, some existing methods discard the individual object generation capabilities to handle complex scenes. This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations. To learn the global object-centric representations that encapsulate globally invariant attributes of objects (i.e., the complete appearance and shape), this paper designs a Disentangled Slot Attention module to convert the scene features into scene-dependent attributes (such as scale, position and orientation) and scene-independent representations (i.e., appearance and shape). Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.

Paper Structure

This paper contains 15 sections, 8 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overviews of GOLD. GOLD consists of the Image Encoder-Decoder module and the Global Object-centric Learning module. Image Encoder-Decoder module is used to convert the input image into patch features and reconstruct the scene image via a VQ-VAE decoder with the reconstructed patch features. Global Object-Centric Learning module is used to learn the global object-centric representations $\boldsymbol{e}_{1:C}^{\text{glo}}$ and extract the extrinsic representation $\boldsymbol{s}_{1:K}^{\text{ext}}$, identity representation $\boldsymbol{y}_{1:K}$ and background representation $\boldsymbol{s}^{\text{bck}}$. Background features are encoded and decoded individually using the Bck Encoder and Bck Decoder. Disentangled Slot Attention module used to extract the scene-dependent attributes representation $\boldsymbol{s}_{1:K}^{\text{ext}}$ and scene-independent representation $\boldsymbol{y}_{1:K}$. $\boldsymbol{y}_{1:K}$ is used to select the corresponding intrinsic representation from $\boldsymbol{e}_{1:C}^{\text{glo}}$, to reconstruct the patch features via Object Decoder. Image Encoder-Decoder module is optimized by minimizing the Mean Square Error loss between the input and reconstructed images. Global Object-Centric Learning module is trained by minimizing the Mean Square Error between extracted and reconstructed patch features.
  • Figure 2: The visualization of Prototype Images generated by GOLD and GOCL. From 'Pro1' to 'Pro10'(or 'Pro11'), denote one prototype image generated, respectively.
  • Figure 3: The visualization of Scene Generation with Specific Objects on the GSO dataset. 'Ext1' and 'Ext2' indicate two different representations of the extrinsic attributes. 'Sample1'$\sim$'Sample4' denote four samples each of which contains six scene images. In each sample, each row of scene images contains the same objects, and each column contains increasing numbers of objects. The global object-centric representations of the two columns of scene images are the same, but the extrinsic attribute representations are different.
  • Figure 4: The visualization of Attributes Disentanglement of GOLD on four datasets. 'BEE' indicates the reconstructed scene before exchanging extrinsic attributes. 'AEE' denotes the generated scene after exchanging extrinsic attributes. Each dataset showed two samples ('sample1' and 'sample2'). The red and green arrows in each sample point to the two objects whose extrinsic attributes are exchanged. In the 'BEE' and 'AEE' rows, the arrows of the same color point to the objects with the same intrinsic attributes, i.e., the two objects before and after exchanging the extrinsic attributes.
  • Figure 5: The visualization of Individual Object Generation of GOLD, STEVE and LSD on the GSO and OCTA datasets. 'Image' indicates the input image. 'Back' denotes the background generated via the background representation (Only GOLD). 'Obj1' $\sim$'Obj8' presents the individual object image generated via the corresponding extrinsic representation and the selected global object-centric representation.
  • ...and 1 more figures