Table of Contents
Fetching ...

Generalized multi-object classification and tracking with sparse feature resonator networks

Lazar Supic, Alec Mullen, E. Paxon Frady

Abstract

In visual scene understanding tasks, it is essential to capture both invariant and equivariant structure. While neural networks are frequently trained to achieve invariance to transformations such as translation, this often comes at the cost of losing access to equivariant information - e.g., the precise location of an object. Moreover, invariance is not naturally guaranteed through supervised learning alone, and many architectures generalize poorly to input transformations not encountered during training. Here, we take an approach based on analysis-by-synthesis and factoring using resonator networks. A generative model describes the construction of simple scenes containing MNIST digits and their transformations, like color and position. The resonator network inverts the generative model, and provides both invariant and equivariant information about particular objects. Sparse features learned from training data act as a basis set to provide flexibility in representing variable shapes of objects, allowing the resonator network to handle previously unseen digit shapes from the test set. The modular structure provides a shape module which contains information about the object shape with translation factored out, allowing a simple classifier to operate on centered digits. The classification layer is trained solely on centered data, requiring much less training data, and the network as a whole can identify objects with arbitrary translations without data augmentation. The natural attention-like mechanism of the resonator network also allows for analysis of scenes with multiple objects, where the network dynamics selects and centers only one object at a time. Further, the specific position information of a particular object can be extracted from the translation module, and we show that the resonator can be designed to track multiple moving objects with precision of a few pixels.

Generalized multi-object classification and tracking with sparse feature resonator networks

Abstract

In visual scene understanding tasks, it is essential to capture both invariant and equivariant structure. While neural networks are frequently trained to achieve invariance to transformations such as translation, this often comes at the cost of losing access to equivariant information - e.g., the precise location of an object. Moreover, invariance is not naturally guaranteed through supervised learning alone, and many architectures generalize poorly to input transformations not encountered during training. Here, we take an approach based on analysis-by-synthesis and factoring using resonator networks. A generative model describes the construction of simple scenes containing MNIST digits and their transformations, like color and position. The resonator network inverts the generative model, and provides both invariant and equivariant information about particular objects. Sparse features learned from training data act as a basis set to provide flexibility in representing variable shapes of objects, allowing the resonator network to handle previously unseen digit shapes from the test set. The modular structure provides a shape module which contains information about the object shape with translation factored out, allowing a simple classifier to operate on centered digits. The classification layer is trained solely on centered data, requiring much less training data, and the network as a whole can identify objects with arbitrary translations without data augmentation. The natural attention-like mechanism of the resonator network also allows for analysis of scenes with multiple objects, where the network dynamics selects and centers only one object at a time. Further, the specific position information of a particular object can be extracted from the translation module, and we show that the resonator can be designed to track multiple moving objects with precision of a few pixels.
Paper Structure (11 sections, 2 equations, 2 figures, 1 table)

This paper contains 11 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Scene understanding with sparse feature resonator networks.A. A simple scene with an MNIST digit is presented. The task is to factorize shape, location, and color, where each factor is represented by one of the resonator network modules. B. Visualization of the resonator network dynamics. Here, three resonator networks are operating in parallel, one for each row. Iteration time flows down. During early iterations dynamics are random an chaotic. Around iteration 5-10 the network finds a solution and converges. Yellow indicates highest output, the maximum peak is taken as the output for color and position. C. Example sparse basis functions of MNIST digits were learned from a separate training set. D. The coefficients from the shape module and the sparse basis functions are combined to reconstruct the object with position and color factored out. A classifier then predicts digit identity from the centered digit; translation/color invariance is handled by the resonator network. E. The full scene is reconstructed from the resonator network outputs.
  • Figure 2: Multi-object motion tracking with sparse feature resonator networks.A. The input to the resonator network is a video of the simple MNIST scene with each digit moving along a trajectory, visualized by the colored arrows. B. The resonator dynamics are visualized during the tracking task. During initial iterations each network searches for one of the objects. Once an object is "locked-on", the shape and color modules converge, while the position module continues updating following the object. The object's trajectory can be decoded from the activity peaks of the position modules. C. The converged shapes of each resonator network's shape module is visualized. D. The distance between the estimated position and ground-truth position is calculated over time. The intial searching phase shows high errors, but once the object is locked on the tracking precision error drops to 1 or 2 pixels. E. Summary of distance measurements from 100 3-object tracking experiments.