Table of Contents
Fetching ...

Foundation Models for Amodal Video Instance Segmentation in Automated Driving

Jasmin Breitenstein, Franz Jünger, Andreas Bär, Tim Fingscheidt

TL;DR

This work exploits the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task, and achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels.

Abstract

In this work, we study amodal video instance segmentation for automated driving. Previous works perform amodal video instance segmentation relying on methods trained on entirely labeled video data with techniques borrowed from standard video instance segmentation. Such amodally labeled video data is difficult and expensive to obtain and the resulting methods suffer from a trade-off between instance segmentation and tracking performance. To largely solve this issue, we propose to study the application of foundation models for this task. More precisely, we exploit the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task. Given an initial video instance segmentation, we sample points from the visible masks to prompt our amodal SAM. We use a point memory to store those points. If a previously observed instance is not predicted in a following frame, we retrieve its most recent points from the point memory and use a point tracking method to follow those points to the current frame, together with the corresponding last amodal instance mask. This way, while basing our method on an amodal instance segmentation, we nevertheless obtain video-level amodal instance segmentation results. Our resulting S-AModal method achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels. Code for S-AModal is available at https://github.com/ifnspaml/S-AModal.

Foundation Models for Amodal Video Instance Segmentation in Automated Driving

TL;DR

This work exploits the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task, and achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels.

Abstract

In this work, we study amodal video instance segmentation for automated driving. Previous works perform amodal video instance segmentation relying on methods trained on entirely labeled video data with techniques borrowed from standard video instance segmentation. Such amodally labeled video data is difficult and expensive to obtain and the resulting methods suffer from a trade-off between instance segmentation and tracking performance. To largely solve this issue, we propose to study the application of foundation models for this task. More precisely, we exploit the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task. Given an initial video instance segmentation, we sample points from the visible masks to prompt our amodal SAM. We use a point memory to store those points. If a previously observed instance is not predicted in a following frame, we retrieve its most recent points from the point memory and use a point tracking method to follow those points to the current frame, together with the corresponding last amodal instance mask. This way, while basing our method on an amodal instance segmentation, we nevertheless obtain video-level amodal instance segmentation results. Our resulting S-AModal method achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels. Code for S-AModal is available at https://github.com/ifnspaml/S-AModal.
Paper Structure (9 sections, 1 equation, 10 figures, 6 tables, 1 algorithm)

This paper contains 9 sections, 1 equation, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: In a video sequence, when an instance is (partially) visible (time step $t-1$), we extract points (green arrow) from the predicted instance mask to prompt an amodal SAM method to generate an amodal mask (yellow). The corresponding points are stored, and if the instance is not visible (time step $t$), we track the previous points to the next frame ($t$), transferring the previous amodal mask to the next frame (yellow, purple arrow). Once the instance reappears (time step $t+1$), we prompt the amodal SAM method again (green arrow).
  • Figure 1: Detailed structure of the adapter block Chen2023 (left) and the SAM network Kirillov2023 used during our amodal fine-tuning (right). Snowflakes indicate layers frozen during fine-tuning while the gear wheel indicates adjustable layers. The fine-tuned SAM network is used in our S-AModal method as amodal SAM$\mathbf{f}^\text{aSAM}()$ network, as shown in Figure 2.
  • Figure 2: Overview of our S-AModal method: Given an input frame $\mathbf{x}_t$, it predicts amodal instance segmentation masks $\left(\mathbf{a}_{t,n}\right)_{n \in \mathcal{N}_{t}}$. First, a VIS method provides visible instance masks $\left(\mathbf{m}_{t,n}\right)_{n \in \mathcal{N}_{t}^\text{vis}}$, from which we extract points $\left(\mathbf{p}_{t,n}\right)_{n \in \mathcal{N}_{t}^\text{vis}}$.These points prompt our amodal SAM method to produce amodal instance masks $\left(\mathbf{a}_{t,n}\right)_{n \in \mathcal{N}_{t}^\text{vis}}$. Points are stored to track for occlusions, helping to update previous masks $\left(\mathbf{a}_{t-1,n}\right)_{n \in \mathcal{N}_{t-1}^\text{vis}}$ to $\left(\mathbf{a}_{t,n}\right)_{n \in \mathcal{N}_{t}^\text{inv}}$. Final amodal masks per frame $\left(\mathbf{a}_{t,n}\right)_{n \in \mathcal{N}_{t}}$ combine $\left(\mathbf{a}_{t,n}\right)_{n \in \mathcal{N}_{t}^\text{inv}}$ and $\left(\mathbf{a}_{t,n}\right)_{n \in \mathcal{N}_{t}^\text{vis}}$. We denote delay units by "T".
  • Figure 2: Schematic visualization of a point prompt $p_{t,n,k}$ (left, green cross) resulting in the yellow amodal mask. For better understanding, we denote the point prompt using its height $h$ and width $w$ value, i.e., $(h,w)=(250,400)$ (indicated by green line). Right: visualization of the same point prompt (green cross) in the corresponding image of $\mathcal{D}_\text{ASD}^\text{val}$ again with height and width value $(h,w)=(250,450)$ (green lines) resulting in the yellow amodal mask.
  • Figure 3: Schematic view of a video sequence $\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1}$ and an instance $n$ which is partially visible at $t-1$ at $t+1$, but fully occluded at $t$. For frames $\mathbf{x}_{t-1}$ and $\mathbf{x}_{t+1}$, the points $p_{t-1,n,k}$ and $p_{t+1,n,k}$ of the predicted visible instance masks $\mathbf{m}_{t-1,n}$ and $\mathbf{m}_{t+1,n}$, are used to prompt the amodal SAM model to obtain $\mathbf{a}_{t-1,n}$ and $\mathbf{a}_{t+1,n}$, respectively. For frame $\mathbf{x}_t$, we apply point tracking to obtain the predicted point $\hat{p}_{t,n,k}$ and shift the amodal mask along this trajectory to $\hat{\mathbf{a}}_{t,n}$.
  • ...and 5 more figures