Table of Contents
Fetching ...

DensePercept-NCSSD: Vision Mamba towards Real-time Dense Visual Perception with Non-Causal State Space Duality

Tushar Anand, Advik Sinha, Abhijit Das

TL;DR

This work tackles the real-time, high-accuracy estimation of dense optical flow and stereo disparity by introducing DensePercept-NCSSD, a two-branch architecture built on a non-causal Mamba block and a non-causal state-space duality (SSD). By replacing the quadratic attention of transformers with linear, non-causal SSM-based computation and a pyramid-based matching scheme, the model achieves a favorable speed-accuracy-memory balance. The authors provide extensive experiments on optical flow and disparity across KITTI, VKITTI, Sintel, and Sceneflow, reporting state-of-the-art or competitive EPE, D1, FPS, and SOMER metrics while maintaining real-time capabilities. The proposed approach promises practical impact for real-time robotic perception and autonomous systems by delivering unified dense perception with reduced computational overhead. Overall, DensePercept-NCSSD demonstrates that non-causal SSD-based Mamba blocks can bridge speed, accuracy, and memory requirements in joint flow and disparity tasks.

Abstract

In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in a real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis, and validation in real-life scenario justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at https://github.com/vimstereo/DensePerceptNCSSD

DensePercept-NCSSD: Vision Mamba towards Real-time Dense Visual Perception with Non-Causal State Space Duality

TL;DR

This work tackles the real-time, high-accuracy estimation of dense optical flow and stereo disparity by introducing DensePercept-NCSSD, a two-branch architecture built on a non-causal Mamba block and a non-causal state-space duality (SSD). By replacing the quadratic attention of transformers with linear, non-causal SSM-based computation and a pyramid-based matching scheme, the model achieves a favorable speed-accuracy-memory balance. The authors provide extensive experiments on optical flow and disparity across KITTI, VKITTI, Sintel, and Sceneflow, reporting state-of-the-art or competitive EPE, D1, FPS, and SOMER metrics while maintaining real-time capabilities. The proposed approach promises practical impact for real-time robotic perception and autonomous systems by delivering unified dense perception with reduced computational overhead. Overall, DensePercept-NCSSD demonstrates that non-causal SSD-based Mamba blocks can bridge speed, accuracy, and memory requirements in joint flow and disparity tasks.

Abstract

In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in a real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis, and validation in real-life scenario justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at https://github.com/vimstereo/DensePerceptNCSSD

Paper Structure

This paper contains 15 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the proposed DensePercept-NCSSD. The images in the stereo pair are represented in red and purple. (A),(B) and (C) are state matrices. The negative sign(-) represents a split at the batch dimension. The addition sign(+) represents concatenation at the batch dimension.
  • Figure 2: Overview of the context encoder used as reference network.
  • Figure 3: Overview of the macro architecture, which consists of the feature extraction and the machining block.
  • Figure 4: Visual representation of Flow on KITTI15 Dataset
  • Figure 5: Visual representation of disparity on KITTI15 Dataset