Unsupervised Object Detection with Theoretical Guarantees

Marian Longa; João F. Henriques

Unsupervised Object Detection with Theoretical Guarantees

Marian Longa, João F. Henriques

TL;DR

This work develops an unsupervised object detection architecture and proves that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process.

Abstract

Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process. We perform detailed analysis of how the error depends on each of these variables and perform synthetic experiments validating our theoretical predictions up to a precision of individual pixels. We also perform experiments on CLEVR-based data and show that, unlike current SOTA object detection methods (SAM, CutLER), our method's prediction errors always lie within our theoretical bounds. We hope that this work helps open up an avenue of research into object detection methods with theoretical guarantees.

Unsupervised Object Detection with Theoretical Guarantees

TL;DR

Abstract

Paper Structure (35 sections, 7 theorems, 22 equations, 11 figures)

This paper contains 35 sections, 7 theorems, 22 equations, 11 figures.

Introduction
Related Work
Object Detection.
Identifiability in Representation Learning.
Method
Autoencoder with CNN Encoder and Decoder.
From Encoder Feature Maps to Latent Variables.
From Latent Variables to Decoder Feature Maps.
Theoretical Results
Experimental Results
Synthetic Experiments
Position Error vs. Encoder RF Size.
Position Error vs. Decoder RF Size.
Position Error vs. Object Size.
Position Error vs. Gaussian Size.
...and 20 more sections

Key Result

Theorem 4.1

Consider a set of images $x \sim X$ with objects of size $s_o$, CNN encoder $\psi$ with receptive field size $s_{\psi}$, CNN decoder $\phi$ with receptive field size $s_{\phi}$, soft argmax function $\mathrm{softargmax}$, rendering function $\mathrm{render}$ with Gaussian standard deviation $\sigma_

Figures (11)

Figure 1: Network architecture. Encoder: (1) an image $x$ is passed through a CNN $\psi$ to obtain $n$ embedding maps $e_1,...,e_n$, (2) a maximum of each map is found using softargmax to obtain latent variables $[z_{1,x},z_{1,y},...,z_{n,x},z_{n,y}]$. Decoder: (1) Gaussians $\hat{e}_1,...,\hat{e}_n$ are rendered at the positions given by the latent variables, (2) the Gaussian maps are concatenated with positional encodings and passed through a CNN $\phi$ to obtain the predicted image $\hat{x}$. Finally, $x$ and $\hat{x}$ are used to compute reconstruction loss $\mathcal{L}(\hat{x},x)$.
Figure 2: Position errors. (a) Maximum position error due to encoder, given by $s_\psi/2+s_o/2-1$. The maximum error occurs when the encoder and the object are as far away from each other as possible while still overlapping by one pixel. (b) Maximum position error due to decoder, given by $s_\phi/2-s_o/2+\Delta_G$. The maximum error occurs when some part of the Gaussian at position $z+\Delta_G$ is within the decoder receptive field (RF) but is as far away from the rendered object as possible.
Figure 3: Theoretical bounds for the maximum position error as a function of the encoder receptive field size $s_\psi$, decoder receptive field size $s_\phi$, object size $s_o$, and Gaussian standard deviation $\sigma_G$, as the remaining factors are fixed. Each bound consists of a region due to the encoder error (solid line) and the decoder error (probabilistic bound). Standard deviations are represented by shades of blue.
Figure 4: Synthetic experiment results showing position error as a function of the encoder receptive field size $s_\psi$, decoder receptive field size $s_\phi$, object size $s_o$, and Gaussian standard deviation $\sigma_G$, as the remaining factors are fixed to $s_\psi=9, s_\phi=25, s_o=9, \sigma_G=0.8$ (in a,b,c) or to $s_\psi=9, s_\phi=11, s_o=7$ (in d). Theoretical bounds are denoted by a blue line (with 4 shaded regions denoting 1 to 4 standard deviations of the probabilistic bound) and experimental results by red dots.
Figure 5: CLEVR experiment results showing position error as a function of the encoder receptive field size $s_\psi$, decoder receptive field size $s_\phi$, and object size $s_o$, and Gaussian standard deviation $\sigma_G$, as the remaining factors are fixed to $s_\psi=9, s_\phi=25, s_o \in [6,10], \sigma_G=0.8$ for (a)-(c) and to $s_\psi=5, s_\phi=13, s_o \in [6,10]$ for (d). Theoretical bounds are denoted by blue, experimental results in red, SAM baseline in green, and CutLER baseline in orange.
...and 6 more figures

Theorems & Definitions (8)

Theorem 4.1: Error Bound
Corollary 4.2: Error Bound vs. Encoder RF Size
Corollary 4.3: Error Bound vs. Decoder RF Size
Corollary 4.4: Error Bound vs. Object Size
Corollary 4.5: Error Bound vs. Gaussian Size
proof
Corollary B.1: Error vs. Encoder RF Size for Multiple Object Sizes
Corollary B.2: Error vs. Decoder RF Size for Multiple Object Sizes

Unsupervised Object Detection with Theoretical Guarantees

TL;DR

Abstract

Unsupervised Object Detection with Theoretical Guarantees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (8)