Tracking objects that change in appearance with phase synchrony

Sabine Muzellec; Drew Linsley; Alekh K. Ashok; Ennio Mingolla; Girik Malik; Rufin VanRullen; Thomas Serre

Tracking objects that change in appearance with phase synchrony

Sabine Muzellec, Drew Linsley, Alekh K. Ashok, Ennio Mingolla, Girik Malik, Rufin VanRullen, Thomas Serre

TL;DR

The study tackles object tracking when appearance changes—such as color, shape, or position—over time. It introduces a complex-valued recurrent neural network (CV-RNN) that uses neural synchrony via phase information to bind object features to their locations, allowing attention to be independent of appearance. Through the FeatureTracker benchmark, CV-RNN approaches human performance and often outperforms other deep networks, providing a computational proof-of-concept that phase synchronization can support tracking of appearance-morphing objects. The work suggests concrete neural mechanisms and predictions for neuroscience, and makes data and code publicly available to spur further investigation into aligning machine vision with human-like tracking strategies.

Abstract

Objects we encounter often change appearance as we interact with them. Changes in illumination (shadows), object pose, or the movement of non-rigid objects can drastically alter available image features. How do biological visual systems track objects as they change? One plausible mechanism involves attentional mechanisms for reasoning about the locations of objects independently of their appearances -- a capability that prominent neuroscience theories have associated with computing through neural synchrony. Here, we describe a novel deep learning circuit that can learn to precisely control attention to features separately from their location in the world through neural synchrony: the complex-valued recurrent neural network (CV-RNN). Next, we compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs), using FeatureTracker: a large-scale challenge that asks observers to track objects as their locations and appearances change in precisely controlled ways. While humans effortlessly solved FeatureTracker, state-of-the-art DNNs did not. In contrast, our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization as a neural substrate for tracking appearance-morphing objects as they move about.

Tracking objects that change in appearance with phase synchrony

TL;DR

Abstract

Paper Structure (56 sections, 17 equations, 27 figures, 5 tables)

This paper contains 56 sections, 17 equations, 27 figures, 5 tables.

Introduction
Contributions.
Background and related work
Visual routines
Computing through neural synchrony
Generalization and shortcut learning in DNNs
Complex-valued representations in artificial neural networks.
Motivation
Neural synchrony can implement visual routines for object tracking
Solving the shell game with the CV-RNN.
The FeatureTracker challenge
Overview
Design
Human benchmark
Results
...and 41 more sections

Figures (27)

Figure 1: How do Biological visual systems track the object tagged by the yellow arrow?(a) Sometimes, the object's appearance makes it easy to track Pylyshyn2006-anPylyshyn1988-pi. (b) Other times, when objects look similar, the target can be tracked by following its motion through the world Lettvin1959-haTakemura2013-chKim2014-bcAdelson1985-otFrye2015-lulinsley2021tracking. Here, we investigate a computational problem that has received far less attention: how do biological visual systems track objects when their colors, textures (c), or shapes (d) change over time? (e) We developed the FeatureTracker challenge to systematically evaluate humans and machine vision systems on this problem. In FeatureTracker, observers watch videos containing objects that change in color and/or shape over time, and have to decide if the target object, which begins in the red square (circled in white for clarity), ends up in the blue square by the end of a video. When presented with a FeatureTracker video, one possible strategy suggested by neuroscience theories is that the oscillatory activity of neural populations can keep track of different objects over time. Specifically, the target is encoded by a population of neurons that fire with a timing that differs from that of the population that responds to the distractors astrand2020neuronal. We approximate the cycle of the oscillation with complex-valued neurons. In the CV-RNN, the phase of a complex-valued neuron represents the object encoded by this neuron. The CV-RNN thus learns to tag the target with a phase value different from the phase value of the distractors.
Figure 2: Neural synchrony helps track objects that change in appearance.(a) The shell game is designed to probe how a neural network, with the functional constraints of biological visual systems, could track objects as they change in appearance between frames one and two. Are the two images the same, or has the objects' color and/or orientation flipped (three possible responses)? (b) We tested a simplified model of the hierarchical visual system on the task, which consisted of two layers of neurons: (i) a convolutional layer with high-resolution feature maps, followed by (ii) a spatial average pooling of neuron responses and a layer of recurrently connected neurons mclelland2016theta. 1c/2c are object colors, 1o/2o are object orientations; the loss of spatial resolution between the layers causes these object features to interfere. The model can detect the features present in the frame (red and blue color, as well as square and diamond orientations), but fails at binding the color and orientation with the position -- hence cannot differentiate Frame 1 from Frame 2. (c, d) The same architecture can learn to solve the task with a complex-valued mechanism for neural synchrony, in which the magnitude of neurons captures object appearances, and the phase captures object locations.
Figure 3: Implementing neural synchrony through the complex-valued RNN (CV-RNN). The CV-RNN augments the InT RNN from linsley2021tracking (shown on the left) with neural synchrony attention through the use of complex-valued units (shown on the right). In the CV-RNN, $e_c$ and $z_c$ convert $e$ and $z$ to the complex domain, $\phi$ is a recurrent unit maintaining a complex representation of the input, and $\theta$ transforms $\phi$ into a spatial map of the current frame.
Figure 4: The FeatureTracker challenge is a controllable environment where the objects can evolve along three feature dimensions: position, shape, and color. The training distribution is generated from objects evolving in the upper-left quadrant of the 3D space (red cube), corresponding to half of the possible colors and shapes. The other testing conditions contain respectively objects of colors sampled from the other half of the spectrum but the same shapes (upper-right quadrant -- green cube), same colors but different shapes (lower-left quadrant -- purple cube), unseen colors and shapes (lower-right quadrant -- blue cube). The task is to track the target located in the red marker in the first frame and to assess whether this target (shown here with a white arrow to improve visibility) or a distractor reaches the blue marker at the end of the video.
Figure 5: Human and DNN performance on FeatureTracker. (a) Humans and models are trained on videos where objects change in color and shape according to the distribution represented by the red cube. Both are then tested on videos where objects have appearances sampled from the same or different distributions. While humans are extremely accurate in each case, only the CV-RNN approaches their performance. (b) In a second experiment, we tested how humans and models perform on versions of the challenge where only the shape and position (top-right), color and position (bottom-left), or position alone (bottom-right; linsley2021tracking) of objects change over time. Model performance and 95% confidence intervals, along with the mean (dotted line) and 95% confidence interval (grey box) of human performance are plotted for each condition. Darker bars indicate DNNs that were pre-trained, whereas lighter bars are DNNs trained from scratch. S=shape, P=position, C=color.
...and 22 more figures

Tracking objects that change in appearance with phase synchrony

TL;DR

Abstract

Tracking objects that change in appearance with phase synchrony

Authors

TL;DR

Abstract

Table of Contents

Figures (27)