Temporal Micro-Doppler Spectrogram-based ViT Multiclass Target Classification
Nghia Thinh Nguyen, Tri Nhu Do
TL;DR
The paper addresses multiclass target classification in cluttered MDS data from mmWave FMCW radar by modeling temporal dynamics with a Vision Transformer. It introduces the Temporal MDS-ViT (T-MDS-ViT) that ingests stacked RVA slices via patch embeddings and cross-axis attention, augmented by mobility-aware constraints and Grad-CAM explainability. The approach achieves superior accuracy and data efficiency compared with CNN baselines, while maintaining real-time deployability, and provides interpretable attention maps that highlight high-energy motion regions in the MDS. This framework advances robust target discrimination under overlaps and occlusions, enabling practical radar-based sensing with cost-effective hardware.
Abstract
In this paper, we propose a new Temporal MDS-Vision Transformer (T-MDS-ViT) for multiclass target classification using millimeter-wave FMCW radar micro-Doppler spectrograms. Specifically, we design a transformer-based architecture that processes stacked range-velocity-angle (RVA) spatiotemporal tensors via patch embeddings and cross-axis attention mechanisms to explicitly model the sequential nature of MDS data across multiple frames. The T-MDS-ViT exploits mobility-aware constraints in its attention layer correspondences to maintain separability under target overlaps and partial occlusions. Next, we apply an explainable mechanism to examine how the attention layers focus on characteristic high-energy regions of the MDS representations and their effect on class-specific kinematic features. We also demonstrate that our proposed framework is superior to existing CNN-based methods in terms of classification accuracy while achieving better data efficiency and real-time deployability.
