Table of Contents
Fetching ...

Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

Christopher J. Kymn, Sonia Mazelet, Annabel Ng, Denis Kleyko, Bruno A. Olshausen

TL;DR

This work presents a two-stage framework for visual scene factorization that combines convolutional sparse coding with a resonator network within the HD/VSA paradigm. By first extracting a sparse, translation-equivariant latent representation and then encoding it into a high-dimensional vector, the method enables efficient factorization of multiple objects and their poses via a resonator network. The approach yields higher accuracy, faster convergence, and stronger confidence-based stopping compared to pixel-based encoding across datasets including Random Bars and Translated MNIST, and is shown to be scalable to multiple objects. The findings demonstrate the value of compositional, transparent representations for visual scene understanding and point to future hardware and transformation-learning extensions.

Abstract

We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.

Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

TL;DR

This work presents a two-stage framework for visual scene factorization that combines convolutional sparse coding with a resonator network within the HD/VSA paradigm. By first extracting a sparse, translation-equivariant latent representation and then encoding it into a high-dimensional vector, the method enables efficient factorization of multiple objects and their poses via a resonator network. The approach yields higher accuracy, faster convergence, and stronger confidence-based stopping compared to pixel-based encoding across datasets including Random Bars and Translated MNIST, and is shown to be scalable to multiple objects. The findings demonstrate the value of compositional, transparent representations for visual scene understanding and point to future hardware and transformation-learning extensions.

Abstract

We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.
Paper Structure (16 sections, 8 equations, 5 figures)

This paper contains 16 sections, 8 equations, 5 figures.

Figures (5)

  • Figure 1: Results from the Random Bars dataset. a) Example objects from the Random Bars dataset. Each object consists of white horizontal and vertical bars that are placed at random positions. b) Ground truth convolutional sparse code, the two basis functions are white horizontal and vertical bars. c) Single trial accuracy of resonator network inference for a varying degree of items in the shape codebook, given a fixed dimension of $D=2500$. The sparse representation consistently outperforms the pixel encoding. d) The resonator network also converges in fewer iterations when using sparse representations. Solid lines report median number of iterations, with intervals reporting 25th and 75th percentiles. e) Sparse coding helps to improve decoding of multiple shapes; multi-object accuracy is higher and decays slower for increasing numbers of present shapes.
  • Figure 2: Results from the Translated MNIST dataset. a) An example scene composed of 4 handwritten digits from the Translated MNIST dataset. b) The learned set of basis functions $\{ \boldsymbol\phi_j \}$; each element resembles a localized stroke. c) Absolute values of the convolutional sparse code are averaged over different channels. A much lower fraction of spatial positions are active compared to the original image. d) Improvements of the sparse representation and the resonator network over the pixel encoding baseline, in terms of accuracy on images of varying sizes. e) Sparse representations also reduce the required number of iterations for the resonator network to converge. Solid lines report median number of iterations, with intervals reporting 25th and 75th percentiles. f) Improvements of sparse representations when operating on scenes with multiple digits. These results are consistent with our findings on the Random Bars dataset.
  • Figure 3: Confidence metrics, tested on the Translated MNIST dataset. a) Example evolution of confidence over the course of resonator network dynamics. Early stopping when all confidence values cross a threshold. b) Distribution of confidence values after convergence in the case when the resonator network is correct and incorrect. c) The confidence-based stopping criterion converges in fewer iterations compared to the fixed point stopping criterion (using thresholds of $0.6$ for sparse and $0.3$ for pixel). d) Using the thresholds in (c) are sufficient to achieve the same level of accuracy as with the fixed point criterion.
  • Figure 4: Results on the Letters dataset. a) Four example images (out of $26$) from the Letters dataset. b) Resonator network factorization is substantially more accurate when working with sparse code representations as opposed to the pixel encodings, for visual scenes of varying sizes. c) The convergence of the resonator network is also faster when factorizing sparse code representations. Data points report medians, and bars indicate $25$th and $75$th percentiles. These results are consistent with experiments on other datasets (see Figs. \ref{['fig:bars dataset']} and \ref{['fig:mnist_results']}).
  • Figure 5: A comparison of the resonator network's performance on the Translated MNIST dataset when taking whitening into account ($D=2{,}500$, which is lower than for results shown in Fig. \ref{['fig:mnist_results']}). a) Whitening improves the performance of both pixel encodings and sparse representations. b) A comparison of median number of iterations for both whitened and unwhitened vectors. When whitening transformations are applied, sparse representations still converge in fewer iterations compared to pixel encodings. The eventual decline in the number of iterations for low accuracy regimes indicates that, in these situations, the resonator network is often converging to local minima.