Table of Contents
Fetching ...

Visual Odometry with Neuromorphic Resonator Networks

Alpha Renner, Lazar Supic, Andreea Danielescu, Giacomo Indiveri, E. Paxon Frady, Friedrich T. Sommer, Yulia Sandamirskaya

TL;DR

This work tackles energy-efficient visual odometry for mobile robots by leveraging neuromorphic hardware and Vector Symbolic Architectures (VSA). It introduces a hierarchical resonator network (HRN) operating on Fourier Holographic Reduced Representations with fractional power encoding to perform image-to-map registration and update an allocentric map for 2D motion. The approach demonstrates competitive or state-of-the-art performance on event-based VO benchmarks and shows robustness in robotic-arm and dynamic-scene experiments, with optional IMU fusion further improving accuracy. The results point to a path toward low-power, low-latency VO suitable for neuromorphic chips, enabling efficient navigation in drones, AR glasses, and planetary rovers.

Abstract

Visual Odometry (VO) is a method to estimate self-motion of a mobile robot using visual sensors. Unlike odometry based on integrating differential measurements that can accumulate errors, such as inertial sensors or wheel encoders, visual odometry is not compromised by drift. However, image-based VO is computationally demanding, limiting its application in use cases with low-latency, -memory, and -energy requirements. Neuromorphic hardware offers low-power solutions to many vision and AI problems, but designing such solutions is complicated and often has to be assembled from scratch. Here we propose to use Vector Symbolic Architecture (VSA) as an abstraction layer to design algorithms compatible with neuromorphic hardware. Building from a VSA model for scene analysis, described in our companion paper, we present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks. Specifically, the proposed algorithm stores and updates a working memory of the presented visual environment. Based on this working memory, a resonator network estimates the changing location and orientation of the camera. We experimentally validate the neuromorphic VSA-based approach to VO with two benchmarks: one based on an event camera dataset and the other in a dynamic scene with a robotic task.

Visual Odometry with Neuromorphic Resonator Networks

TL;DR

This work tackles energy-efficient visual odometry for mobile robots by leveraging neuromorphic hardware and Vector Symbolic Architectures (VSA). It introduces a hierarchical resonator network (HRN) operating on Fourier Holographic Reduced Representations with fractional power encoding to perform image-to-map registration and update an allocentric map for 2D motion. The approach demonstrates competitive or state-of-the-art performance on event-based VO benchmarks and shows robustness in robotic-arm and dynamic-scene experiments, with optional IMU fusion further improving accuracy. The results point to a path toward low-power, low-latency VO suitable for neuromorphic chips, enabling efficient navigation in drones, AR glasses, and planetary rovers.

Abstract

Visual Odometry (VO) is a method to estimate self-motion of a mobile robot using visual sensors. Unlike odometry based on integrating differential measurements that can accumulate errors, such as inertial sensors or wheel encoders, visual odometry is not compromised by drift. However, image-based VO is computationally demanding, limiting its application in use cases with low-latency, -memory, and -energy requirements. Neuromorphic hardware offers low-power solutions to many vision and AI problems, but designing such solutions is complicated and often has to be assembled from scratch. Here we propose to use Vector Symbolic Architecture (VSA) as an abstraction layer to design algorithms compatible with neuromorphic hardware. Building from a VSA model for scene analysis, described in our companion paper, we present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks. Specifically, the proposed algorithm stores and updates a working memory of the presented visual environment. Based on this working memory, a resonator network estimates the changing location and orientation of the camera. We experimentally validate the neuromorphic VSA-based approach to VO with two benchmarks: one based on an event camera dataset and the other in a dynamic scene with a robotic task.
Paper Structure (3 sections, 7 equations, 5 figures, 2 tables)

This paper contains 3 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Neuromorphic Event-Based Visual Odometry with the Hierarchical Resonator Network. Events from an event-based camera are collected and encoded using fractional power encoding (FPE) into a VSA vector. The generative model assumes the input is generated from a rotated and translated map. This model is inverted by the hierarchical resonator using two interacting partitions with different frames of reference. The translation ($\hat{x},\hat{y}$) and rotation ($\hat{r}$) (and optionally, scale) estimates are then used to transform the input into map coordinates and to update the map.
  • Figure 2: Tracking of the camera rotation from the event-based shapes_rotation dataset mueggler2017event in simulation. A. Unprocessed readout (inner product similarity) of the resonator states at the beginning of the experiment (left) and in later iterations around second 37.5 of the dataset (right). The r index directly corresponds to the roll angle while v and h are rotated and scaled (calibrated) into pan and tilt angles. Brighter colors indicate a higher inner product similarity of the resonator state with the codebook vector at the given location. After a short orientation phase where several locations are active in superposition, the states converge to a unique solution and follow the camera's movement after less than ten iterations. In later iterations (right), we observe that the similarity peak also becomes broader (i.e., the network is less certain) when the camera moves quickly outside of the area covered by the map, leading to increased error (as seen in B and D). B. Population vector readout (blue) of the angles from the HRN (Net) calibrated to the ground truth (gt) coordinates. Ground truth from motion capture (orange). Camera trajectory from IMU measurement (green). Lowest row: plot of the tracking error (angle between ground truth and output trajectory) over time. C. Map at the end of the experiment (top) and conventional camera frame that shows the scene in its entirety (bottom). D. Event-based camera input (green) and transformed map (red) at 4 different iterations. Overlapping pixels between the map and camera view are yellow. The map was rotated and shifted by the current estimate of the network in order to show correspondence. White dotted lines indicate the borders of the camera image. The area outside of the white lines is zero-padded, as the transformed map can protrude the camera image.
  • Figure 3: Tracking of the camera rotation from the event-based shapes_rotation dataset mueggler2017event in simulation using both IMU and event-based vision sensors. A. Unprocessed readout of the resonator states around second 37.5 of the dataset, for the network including IMU fusion, compare with Fig. \ref{['fig:results_shapes_rotation']}A. B. Comparison of the trajectories and error with and without fusion. The network that uses both VO and IMU performs better in cases where VO is difficult.
  • Figure 4: Tracking of the location and rotation of an event-based camera mounted on a robotic arm. A. The robotic arm setup and the tabletop scene with the event-based camera mounted on the arm. The arm moves back and forth in the arc shown in red. B. Population vector readout (blue) of the angles transformed to the ground truth coordinates. Ground truth from the robotic arm (orange). Lowest row: tracking error over time for location (left axis) and rotation (right axis). C. Population vector readout of the roll angle (color) and x, y locations transformed into the ground truth coordinate system. D. Tracking and mapping are robust against small changes in the scene. The map at the iteration before removal of the bowl, just after removal, and seconds after removal. As soon as the bowl is removed, it fades from the map until it is fully deleted. E. Same as B. but without map learning. When the arm moves out of the range of the initial map (near 2s, 8s, and 14s), the tracking no longer works. However, the network can recover when the map comes back into view. The trajectory is aligned the same way as B.
  • Figure 5: Hierarchical resonator for visual odometry. Colors match Fig.\ref{['fig:overview']}. The dynamics are explained in Eq.\ref{['eq:resonator_dynamics']}.