Table of Contents
Fetching ...

CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications

Jan Blumenkamp, Steven Morad, Jennifer Gielis, Amanda Prorok

TL;DR

This work proposes CoViS-Net, a decentralized visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as spatial comprehension, and demonstrates its use in a multi-robot formation control task across various real-world settings.

Abstract

Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. In this work, we propose CoViS-Net, a decentralized visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird's-eye-view (BEV) representation, even without camera overlap between robots (in contrast to classical methods). We demonstrate its use in a multi-robot formation control task across various real-world settings. We provide code, models and supplementary material online. https://proroklab.github.io/CoViS-Net/

CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications

TL;DR

This work proposes CoViS-Net, a decentralized visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as spatial comprehension, and demonstrates its use in a multi-robot formation control task across various real-world settings.

Abstract

Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. In this work, we propose CoViS-Net, a decentralized visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird's-eye-view (BEV) representation, even without camera overlap between robots (in contrast to classical methods). We demonstrate its use in a multi-robot formation control task across various real-world settings. We provide code, models and supplementary material online. https://proroklab.github.io/CoViS-Net/
Paper Structure (22 sections, 7 equations, 20 figures, 8 tables)

This paper contains 22 sections, 7 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Our model can be used to control the relative pose of multiple follower robots (depicted in yellow and magenta) to a leader robot (blue) following a reference trajectory using visual cues in the environment only (the field-of-view is depicted as colored area). Even if there is no direct visual overlap (intersection of colored areas), the model is able to estimate the relative pose.
  • Figure 2: By leveraging spatial priors, our model provides rough relative positions even without overlapping fields of view, which classical approaches cannot achieve. This figure shows a sample from the simulation validation dataset $\mathcal{D}_\mathrm{Val}^\mathrm{Sim}$ for five nodes, each represented by a column. The top row displays the respective camera image $I_{i}$. The middle row shows the ground truth labels in the coordinate system $\mathcal{F}_i$, with each robot centered and facing upwards at (0, 0), and the ground truth poses $\mathbf{p}_{i, ij}$ of other robots, along with their field of view. The background shows the BEV representation $\mathrm{BEV}_{i}$ for each robot. The bottom row presents the corresponding predictions, displaying pose predictions $\hat{\mathbf{p}}_{i, ij}, \hat{\mathbf{R}}_{i, ij}$ with low uncertainty $\sigma^2_{\mathrm{p}, ij} \: \sigma^2_{\mathrm{R}, ij}$, predicted uncertainties $\sigma^2_{\mathrm{p}, i}{ij}$ as dashed oval and $\sigma^2_{\mathrm{R}, i}{ij}$ similar to the FOV, and the predicted BEV representation $\hat{\mathrm{BEV}}_{i}$ in the background.
  • Figure 3: Overview of the model architecture: We illustrate the decentralized evaluation with respect to ego-robot $v_i$, which receives embeddings $\mathbf{E}_{j}, \mathbf{E}_{k}$ generated by the encoder $f_{\textrm{enc}}$ from the neighbors $v_j, v_k \in \mathcal{N}(v_i)$. For each received embedding, we employ $f_{\textrm{pose}}$ to compute the pose. We concatenate the pose embedding $\mathbf{E}_{ij}$ with the image embedding $\mathbf{E}_{j}$ (and $\mathbf{E}_{ik}$ with $\mathbf{E}_{k}$) and subsequently aggregate this into a node feature $\mathbf{F}_{i}$. Model outputs are framed with a dashed line.
  • Figure 4: We evaluate the tracking performance of our model and uncertainty-aware controller on reference trajectories with two follower robots (blue and orange) positioned to the left and right of the leader robot (green). We report tracking performance over time for position and rotation, as well as the predicted uncertainties. The leader always faces the direction of movement. We show the trajectory for 120 s and the tracking error for the first 60 s.
  • Figure 5: We show snapshots of real-world deployments in different scenes with up to four robots. Each row contains six frames from a video recording, each spaced 1s in time. We indicate the positions of the robots in the first frame, where the leader is circled yellow and the followers blue. The first three samples show indoor scenes, and the bottom sample is an outdoor deployment. The second sample shows a heterogeneous deployment, combining robots of different sizes and dynamics.
  • ...and 15 more figures