Table of Contents
Fetching ...

Unsupervised Object Detection with Theoretical Guarantees

Marian Longa, João F. Henriques

TL;DR

This work develops an unsupervised object detection architecture and proves that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process.

Abstract

Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process. We perform detailed analysis of how the error depends on each of these variables and perform synthetic experiments validating our theoretical predictions up to a precision of individual pixels. We also perform experiments on CLEVR-based data and show that, unlike current SOTA object detection methods (SAM, CutLER), our method's prediction errors always lie within our theoretical bounds. We hope that this work helps open up an avenue of research into object detection methods with theoretical guarantees.

Unsupervised Object Detection with Theoretical Guarantees

TL;DR

This work develops an unsupervised object detection architecture and proves that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process.

Abstract

Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process. We perform detailed analysis of how the error depends on each of these variables and perform synthetic experiments validating our theoretical predictions up to a precision of individual pixels. We also perform experiments on CLEVR-based data and show that, unlike current SOTA object detection methods (SAM, CutLER), our method's prediction errors always lie within our theoretical bounds. We hope that this work helps open up an avenue of research into object detection methods with theoretical guarantees.
Paper Structure (35 sections, 7 theorems, 22 equations, 11 figures)

This paper contains 35 sections, 7 theorems, 22 equations, 11 figures.

Key Result

Theorem 4.1

Consider a set of images $x \sim X$ with objects of size $s_o$, CNN encoder $\psi$ with receptive field size $s_{\psi}$, CNN decoder $\phi$ with receptive field size $s_{\phi}$, soft argmax function $\mathrm{softargmax}$, rendering function $\mathrm{render}$ with Gaussian standard deviation $\sigma_

Figures (11)

  • Figure 1: Network architecture. Encoder: (1) an image $x$ is passed through a CNN $\psi$ to obtain $n$ embedding maps $e_1,...,e_n$, (2) a maximum of each map is found using softargmax to obtain latent variables $[z_{1,x},z_{1,y},...,z_{n,x},z_{n,y}]$. Decoder: (1) Gaussians $\hat{e}_1,...,\hat{e}_n$ are rendered at the positions given by the latent variables, (2) the Gaussian maps are concatenated with positional encodings and passed through a CNN $\phi$ to obtain the predicted image $\hat{x}$. Finally, $x$ and $\hat{x}$ are used to compute reconstruction loss $\mathcal{L}(\hat{x},x)$.
  • Figure 2: Position errors. (a) Maximum position error due to encoder, given by $s_\psi/2+s_o/2-1$. The maximum error occurs when the encoder and the object are as far away from each other as possible while still overlapping by one pixel. (b) Maximum position error due to decoder, given by $s_\phi/2-s_o/2+\Delta_G$. The maximum error occurs when some part of the Gaussian at position $z+\Delta_G$ is within the decoder receptive field (RF) but is as far away from the rendered object as possible.
  • Figure 3: Theoretical bounds for the maximum position error as a function of the encoder receptive field size $s_\psi$, decoder receptive field size $s_\phi$, object size $s_o$, and Gaussian standard deviation $\sigma_G$, as the remaining factors are fixed. Each bound consists of a region due to the encoder error (solid line) and the decoder error (probabilistic bound). Standard deviations are represented by shades of blue.
  • Figure 4: Synthetic experiment results showing position error as a function of the encoder receptive field size $s_\psi$, decoder receptive field size $s_\phi$, object size $s_o$, and Gaussian standard deviation $\sigma_G$, as the remaining factors are fixed to $s_\psi=9, s_\phi=25, s_o=9, \sigma_G=0.8$ (in a,b,c) or to $s_\psi=9, s_\phi=11, s_o=7$ (in d). Theoretical bounds are denoted by a blue line (with 4 shaded regions denoting 1 to 4 standard deviations of the probabilistic bound) and experimental results by red dots.
  • Figure 5: CLEVR experiment results showing position error as a function of the encoder receptive field size $s_\psi$, decoder receptive field size $s_\phi$, and object size $s_o$, and Gaussian standard deviation $\sigma_G$, as the remaining factors are fixed to $s_\psi=9, s_\phi=25, s_o \in [6,10], \sigma_G=0.8$ for (a)-(c) and to $s_\psi=5, s_\phi=13, s_o \in [6,10]$ for (d). Theoretical bounds are denoted by blue, experimental results in red, SAM baseline in green, and CutLER baseline in orange.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 4.1: Error Bound
  • Corollary 4.2: Error Bound vs. Encoder RF Size
  • Corollary 4.3: Error Bound vs. Decoder RF Size
  • Corollary 4.4: Error Bound vs. Object Size
  • Corollary 4.5: Error Bound vs. Gaussian Size
  • proof
  • Corollary B.1: Error vs. Encoder RF Size for Multiple Object Sizes
  • Corollary B.2: Error vs. Decoder RF Size for Multiple Object Sizes