Table of Contents
Fetching ...

Mask-aware inference with State-Space Models

Ignasi Mas, Ramon Morros, Javier-Ruiz Hidalgo, Ivan Huerta

TL;DR

This work introduces Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone and shows the efficacy and generalizability of the approach in the tasks of depth completion, image inpainting, and classification with invalid data.

Abstract

Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.

Mask-aware inference with State-Space Models

TL;DR

This work introduces Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone and shows the efficacy and generalizability of the approach in the tasks of depth completion, image inpainting, and classification with invalid data.

Abstract

Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.
Paper Structure (21 sections, 1 equation, 7 figures, 5 tables)

This paper contains 21 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The VM backbone (left) is unable to handle any invalidity ( white hole in the image), while the PVM (right) modifications allow projecting partially valid patches into valid tokens and processing all tokens (including invalid ones) to produce a valid output sequence.
  • Figure 2: PVM-DC (\ref{['fig:arch:dc_overview']}) estimates a dense depth map from a sparse PNCC ref:3ddfa, $x_{:, i, j}$, and its valid mask, $m_{i, j}$. Using PNCC presents advantages over raw depth as it encodes the true geometry of the scene, as explained in ref:pnccsr. The SFE (\ref{['fig:arch:dc_sfe']}) and DFE (\ref{['fig:arch:dc_dfe']}) extract shallow and deep (respectively) partially valid features that are merged. Then, the resulting partially valid feature maps are iteratively turned completely valid by the filling layer (\ref{['fig:arch:dc_fill_it']}), for the final depth completion task. This architecture is inspired by ref:pnccsr, modifying not only the final head, but also replacing all modules to their mask-aware equivalent, including the original Vision Mamba Modules (VMM) to Partial Vision Mamba Modules (PVMM).
  • Figure 3: PVM-UNet-N: beyond the mask-aware Partial Patch embedding, the PVSS blocks work in a residual flow by aggregating dimensionality-reduced features through a Partial Average Pool 2D. The decoder (right branch) is based on VSS blocks and an Inpainting head.
  • Figure 4: PVM-Cls, that initiates with a Partial Patch embedding followed by a residual PVM block. Then a Partial Average pooling is applied to produce valid tokens that feed the Classification head.
  • Figure 5: KITTI-3D Depth Completion results. (a) RGB image, (b) LiDAR input, (c) LiDAR target, (d) VM result, (e) PVM result, (f) Error map VM, (g) Error map PVM. The detailed silhouette of the human is an example of higher accuracy of PVM-DC over VM-DC, as displayed in the error map.
  • ...and 2 more figures