Table of Contents
Fetching ...

Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints

Jinyang Yuan, Tonglin Chen, Zhimeng Shen, Bin Li, Xiangyang Xue

TL;DR

The paper tackles unsupervised learning of object-centric, compositional scene representations from multiple unspecified viewpoints. It introduces OCLOC, a deep generative model that decouples viewpoint-dependent factors from viewpoint-independent object/background factors and infers them via iterative amortized inference across viewpoints. The approach includes explicit shadow and depth ordering modeling and a carefully designed variational objective with closed-form KL terms, enabling learning without viewpoint annotations. Empirical results on synthetic multi-view datasets show OCLOC achieves competitive or superior performance to supervised baselines and demonstrates strong generalization to more objects and controllable viewpoint manipulation, underscoring its potential for robust, unsupervised scene understanding.

Abstract

Visual scenes are extremely diverse, not only because there are infinite possible combinations of objects and backgrounds but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a multi-object visual scene from multiple viewpoints, humans can perceive the scene compositionally from each viewpoint while achieving the so-called ``object constancy'' across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have a similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints without using any supervision and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. During the inference, latent representations are randomly initialized and iteratively updated by integrating the information in different viewpoints with neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.

Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints

TL;DR

The paper tackles unsupervised learning of object-centric, compositional scene representations from multiple unspecified viewpoints. It introduces OCLOC, a deep generative model that decouples viewpoint-dependent factors from viewpoint-independent object/background factors and infers them via iterative amortized inference across viewpoints. The approach includes explicit shadow and depth ordering modeling and a carefully designed variational objective with closed-form KL terms, enabling learning without viewpoint annotations. Empirical results on synthetic multi-view datasets show OCLOC achieves competitive or superior performance to supervised baselines and demonstrates strong generalization to more objects and controllable viewpoint manipulation, underscoring its potential for robust, unsupervised scene understanding.

Abstract

Visual scenes are extremely diverse, not only because there are infinite possible combinations of objects and backgrounds but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a multi-object visual scene from multiple viewpoints, humans can perceive the scene compositionally from each viewpoint while achieving the so-called ``object constancy'' across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have a similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints without using any supervision and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. During the inference, latent representations are randomly initialized and iteratively updated by integrating the information in different viewpoints with neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.
Paper Structure (37 sections, 18 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 18 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Humans can perceive visual scenes compositionally while maintaining object constancy across different viewpoints (the indexes of objects are arbitrarily chosen).
  • Figure 2: The overall framework of the proposed OCLOC. The main objective of the learning is to reconstruct images of the same visual scene observed from different viewpoints.
  • Figure 3: The probabilistic graphical model of visual scene modeling. $K$ is the maximum number of objects that may appear in the visual scene. $N$ is the number of pixels in each visual scene image. $M$ is the number of viewpoints to observe the visual scene. $\boldsymbol{z}_{1:K}^{\text{obj}}$ and $\boldsymbol{z}^{\text{bck}}$ are continuous latent variables that characterize the viewpoint-independent attributes of objects and the background, respectively. $z_k^{\text{prs}}$ with $1 \!\leq\! k \!\leq\! K$ is a binary latent variable that indicates whether the $k$th object is included in the visual scene. This type of latent variables makes it possible to model the varying number of objects in different visual scenes. $\rho_k$ is a continuous latent variable that defines the distribution to generate $z_k^{\text{prs}}$. $\boldsymbol{z}_{m}^{\text{view}}$ with $1 \!\leq\! m \!\leq\! M$ is a continuous latent variable that determine the $m$th viewpoint to observe the visual scene. $\boldsymbol{x}_{1:M,1:N}$ represents the observed visual scene image. Neural networks are applied to compute parameters of the likelihood function $p(\boldsymbol{x}_{m,n}|\boldsymbol{z}_{m}^{\text{view}}, \boldsymbol{z}_{1:K}^{\text{obj}}, \boldsymbol{z}^{\text{bck}}, \boldsymbol{z}_{1:K}^{\text{prs}})$.
  • Figure 4: The architecture of decoder networks. The batch dimension is omitted for simplicity.
  • Figure 5: The architecture of encoder networks. The batch dimension is omitted for simplicity. In the dashed box that is repeated $T$ times, variables $\boldsymbol{r}^{\text{view}}$ and $\boldsymbol{r}^{\text{attr}}$ are initialized randomly and updated iteratively.
  • ...and 5 more figures