Table of Contents
Fetching ...

EchoPT: A Pretrained Transformer Architecture that Predicts 2D In-Air Sonar Images for Mobile Robotics

Jan Steckel, Wouter Jansen, Nico Huebel

TL;DR

EchoPT addresses robust sonar-based perception for mobile robots by predicting $n$-frame histories and ego-motion to forecast the next sonar frame with a transformer trained in a self-supervised fashion. The approach uses a patch-embedded transformer with parallel CNN and MLP streams to forecast the next sonar frame, achieving state-of-the-art predictive accuracy in one-shot and autoregressive modes on simulated data. Demonstrations in wheel-slip detection and high-noise corridor control show predictive processing can sustain navigation when sensor data is degraded. The work motivates extensions to $3$D$ sonar, spherical data representations, and real-world validation.

Abstract

The predictive brain hypothesis suggests that perception can be interpreted as the process of minimizing the error between predicted perception tokens generated by an internal world model and actual sensory input tokens. When implementing working examples of this hypothesis in the context of in-air sonar, significant difficulties arise due to the sparse nature of the reflection model that governs ultrasonic sensing. Despite these challenges, creating consistent world models using sonar data is crucial for implementing predictive processing of ultrasound data in robotics. In an effort to enable robust robot behavior using ultrasound as the sole exteroceptive sensor modality, this paper introduces EchoPT, a pretrained transformer architecture designed to predict 2D sonar images from previous sensory data and robot ego-motion information. We detail the transformer architecture that drives EchoPT and compare the performance of our model to several state-of-the-art techniques. In addition to presenting and evaluating our EchoPT model, we demonstrate the effectiveness of this predictive perception approach in two robotic tasks.

EchoPT: A Pretrained Transformer Architecture that Predicts 2D In-Air Sonar Images for Mobile Robotics

TL;DR

EchoPT addresses robust sonar-based perception for mobile robots by predicting -frame histories and ego-motion to forecast the next sonar frame with a transformer trained in a self-supervised fashion. The approach uses a patch-embedded transformer with parallel CNN and MLP streams to forecast the next sonar frame, achieving state-of-the-art predictive accuracy in one-shot and autoregressive modes on simulated data. Demonstrations in wheel-slip detection and high-noise corridor control show predictive processing can sustain navigation when sensor data is degraded. The work motivates extensions to D$ sonar, spherical data representations, and real-world validation.

Abstract

The predictive brain hypothesis suggests that perception can be interpreted as the process of minimizing the error between predicted perception tokens generated by an internal world model and actual sensory input tokens. When implementing working examples of this hypothesis in the context of in-air sonar, significant difficulties arise due to the sparse nature of the reflection model that governs ultrasonic sensing. Despite these challenges, creating consistent world models using sonar data is crucial for implementing predictive processing of ultrasound data in robotics. In an effort to enable robust robot behavior using ultrasound as the sole exteroceptive sensor modality, this paper introduces EchoPT, a pretrained transformer architecture designed to predict 2D sonar images from previous sensory data and robot ego-motion information. We detail the transformer architecture that drives EchoPT and compare the performance of our model to several state-of-the-art techniques. In addition to presenting and evaluating our EchoPT model, we demonstrate the effectiveness of this predictive perception approach in two robotic tasks.
Paper Structure (20 sections, 2 equations, 10 figures, 3 tables)

This paper contains 20 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of the experimental setup. Panel a) shows the simulation environment in which a two-wheeled robot drives. A sketch of the robot is shown in panel c). The robot uses an array-based imaging sonar sensor (panel g) capable of generating range-direction energy maps (called energyscapes), shown in panels d)-f). This sensor is modeled in the simulation environment based on accurate models of acoustic propagation and reflection. Panel b) shows what is called the acoustic flow model. This model predicts how objects in the sensor scene move through the perceptive field based on a certain robot motion. The blue flow lines are shown for a linear robot motion. Panels d)-f) show the task that is being solved in this paper: how can novel sensor views be synthesized given a certain set of robot velocity commands $v_{lin}\omega_{r}$. Panel d) shows the prediction based on the naive shifting of the image in the range and direction dimensions. Panel e) shows the operation using the acoustic flow model of panel b). Both these operators can only use the last frame to do the prediction. Panel f) shows the EchoPT model, which takes in $n$ previous frames and velocity commands and predicts the novel view using a transformer neural network.
  • Figure 2: Overview of the network architecture of EchoPT. The EchoPT model has two inputs: the set of $n$ previous input frames (set to 3 in this paper) and the $n+1$ velocity commands (three previous and one for the prediction). The model has three main parallel branches: a transformer branch, a feed-forward convolutional branch for the sonar images, and an MLP pipeline using the velocity commands as input. These three branches are depth-concatenated and passed through more feed-forward convolutional layers to obtain a single output image.
  • Figure 3: Condensed version of figure \ref{['fig:prediction_corridor']} in the appendix. Panel a) shows the target sonar image, and panel b) shows the predicted image. Panel c) shows the difference between the two images, and panels d) and e) show the 2D correlogram.
  • Figure 4: Prediction results of a single frame using three prediction methods: the naive operation, which shifts the image in the range and direction dimension; the Acoustic flow approach, which uses the acoustic flow equations to transform the image; and finally, the EchoPT prediction.
  • Figure 5: A first application of predictive processing in which a robot performs a trajectory in the environment from figure \ref{['fig:overview']}. In two periods (between 10s and 16s, and between 30s and 36s), the robot encounters slip conditions (meaning, the robot is not performing the motion that the robot expects to do). In the first section, the robot is slipping in both wheels; in the second condition, only one wheel slips. The plots show the slip detector, which uses differences in the predicted and measured sensor data for different prediction horizons (one shot, 3-frame auto-regressive, and 5-frame auto-regressive). Longer time horizons give the clearest slip detection signal, with EchoPT being the only one that detects the second slip condition.
  • ...and 5 more figures