Table of Contents
Fetching ...

FlexLoc: Conditional Neural Networks for Zero-Shot Sensor Perspective Invariance in Object Localization with Distributed Multimodal Sensors

Jason Wu, Ziqi Wang, Xiaomin Ouyang, Ho Lyun Jeong, Colin Samplawski, Lance Kaplan, Benjamin Marlin, Mani Srivastava

TL;DR

FlexLoc is introduced, which employs conditional neural networks to inject node perspective information to adapt the localization pipeline, enabling accurate generalization to unseen perspectives with minimal additional overhead.

Abstract

Localization is a critical technology for various applications ranging from navigation and surveillance to assisted living. Localization systems typically fuse information from sensors viewing the scene from different perspectives to estimate the target location while also employing multiple modalities for enhanced robustness and accuracy. Recently, such systems have employed end-to-end deep neural models trained on large datasets due to their superior performance and ability to handle data from diverse sensor modalities. However, such neural models are often trained on data collected from a particular set of sensor poses (i.e., locations and orientations). During real-world deployments, slight deviations from these sensor poses can result in extreme inaccuracies. To address this challenge, we introduce FlexLoc, which employs conditional neural networks to inject node perspective information to adapt the localization pipeline. Specifically, a small subset of model weights are derived from node poses at run time, enabling accurate generalization to unseen perspectives with minimal additional overhead. Our evaluations on a multimodal, multiview indoor tracking dataset showcase that FlexLoc improves the localization accuracy by almost 50% in the zero-shot case (no calibration data available) compared to the baselines. The source code of FlexLoc is available at https://github.com/nesl/FlexLoc.

FlexLoc: Conditional Neural Networks for Zero-Shot Sensor Perspective Invariance in Object Localization with Distributed Multimodal Sensors

TL;DR

FlexLoc is introduced, which employs conditional neural networks to inject node perspective information to adapt the localization pipeline, enabling accurate generalization to unseen perspectives with minimal additional overhead.

Abstract

Localization is a critical technology for various applications ranging from navigation and surveillance to assisted living. Localization systems typically fuse information from sensors viewing the scene from different perspectives to estimate the target location while also employing multiple modalities for enhanced robustness and accuracy. Recently, such systems have employed end-to-end deep neural models trained on large datasets due to their superior performance and ability to handle data from diverse sensor modalities. However, such neural models are often trained on data collected from a particular set of sensor poses (i.e., locations and orientations). During real-world deployments, slight deviations from these sensor poses can result in extreme inaccuracies. To address this challenge, we introduce FlexLoc, which employs conditional neural networks to inject node perspective information to adapt the localization pipeline. Specifically, a small subset of model weights are derived from node poses at run time, enabling accurate generalization to unseen perspectives with minimal additional overhead. Our evaluations on a multimodal, multiview indoor tracking dataset showcase that FlexLoc improves the localization accuracy by almost 50% in the zero-shot case (no calibration data available) compared to the baselines. The source code of FlexLoc is available at https://github.com/nesl/FlexLoc.
Paper Structure (22 sections, 2 equations, 9 figures, 3 tables)

This paper contains 22 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Performance degradation due to sensor perspective shift.
  • Figure 2: Illustration of our key idea to enable perspective invariant object localization. Unlike existing approaches that are unaware of sensor perspectives, we inject node pose information into the network through conditional neural networks.
  • Figure 3: Examples of multi-modal sensor data collected by three nodes.
  • Figure 4: Complete FlexLoc architecture containing Conditional Convolution and Conditional Layer Normalization. Conditional Convolution derives its 1D convolutional kernel weights from the sensor pose and transforms the extracted features of the sensor data. Conditional Layer Normalization is a more lightweight design integrated into the backbones, where we replace the learnable parameters $\gamma$ and $\beta$ with values derived from sensor pose.
  • Figure 5: Implementation and Integration of Conditional 1D Convolution.
  • ...and 4 more figures