Table of Contents
Fetching ...

A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Xin Zhang, Liangxiu Han, Tam Sobeih, Lianghao Han, Darren Dancey

TL;DR

This work tackles depth estimation from asynchronous event camera spikes by introducing a Spike-Driven Transformer (SDT) that operates in a purely spike-based manner for feature extraction, paired with a multi-stage fusion depth head to preserve spatial detail. To address limited labeled event data, it employs a single-stage cross-modality distillation from the large vision foundation model DINOv2, guiding the spike network with both feature perceptual and scale-invariant depth losses. The approach yields substantial energy efficiency (e.g., ~12.43 mJ per inference and ~82.9% power reduction over a CNN baseline) and competitive depth accuracy on synthetic and real datasets, while reducing parameter count by about 42%. These results demonstrate a meaningful advance in neuromorphic depth estimation, enabling practical deployment for autonomous navigation and robotics with resource constraints. The framework also opens avenues for fully spike-based depth fusion and deployment on neuromorphic hardware.

Abstract

Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labelled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability.This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.

A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

TL;DR

This work tackles depth estimation from asynchronous event camera spikes by introducing a Spike-Driven Transformer (SDT) that operates in a purely spike-based manner for feature extraction, paired with a multi-stage fusion depth head to preserve spatial detail. To address limited labeled event data, it employs a single-stage cross-modality distillation from the large vision foundation model DINOv2, guiding the spike network with both feature perceptual and scale-invariant depth losses. The approach yields substantial energy efficiency (e.g., ~12.43 mJ per inference and ~82.9% power reduction over a CNN baseline) and competitive depth accuracy on synthetic and real datasets, while reducing parameter count by about 42%. These results demonstrate a meaningful advance in neuromorphic depth estimation, enabling practical deployment for autonomous navigation and robotics with resource constraints. The framework also opens avenues for fully spike-based depth fusion and deployment on neuromorphic hardware.

Abstract

Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labelled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability.This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.
Paper Structure (22 sections, 10 equations, 11 figures, 3 tables)

This paper contains 22 sections, 10 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: The flowchart of the proposed method
  • Figure 2: The structure of spiking patch embedding
  • Figure 3: The structure of transformer block
  • Figure 4: The structure of the fusion head for depth estimation.
  • Figure 5: a) The RGB image; b) Visualization of self-attention on DINOv2 features; (c) Depth estimation results using a linear probe on frozen DINOv2 features.
  • ...and 6 more figures