Table of Contents
Fetching ...

LiCAF: LiDAR-Camera Asymmetric Fusion for Gait Recognition

Yunze Deng, Haijun Xiong, Bin Feng

TL;DR

LiCAF addresses gait recognition by leveraging LiDAR-camera fusion with a modality-sensitive, asymmetric design. It introduces ACCA for cross-modal channel attention and ICTM for interlaced cross-modal temporal modeling, enabling fine-grained, temporally-aware fusion of depth images and silhouettes. On SUSTech1K, LiCAF achieves state-of-the-art Rank-1 and Rank-5 accuracy, with ablations confirming the effectiveness of both ACCA and ICTM and the benefit of asymmetric information flow. This work demonstrates the practical value of modality-aware fusion and temporal modeling for robust gait representations in challenging, real-world conditions.

Abstract

Gait recognition is a biometric technology that identifies individuals by using walking patterns. Due to the significant achievements of multimodal fusion in gait recognition, we consider employing LiDAR-camera fusion to obtain robust gait representations. However, existing methods often overlook intrinsic characteristics of modalities, and lack fine-grained fusion and temporal modeling. In this paper, we introduce a novel modality-sensitive network LiCAF for LiDAR-camera fusion, which employs an asymmetric modeling strategy. Specifically, we propose Asymmetric Cross-modal Channel Attention (ACCA) and Interlaced Cross-modal Temporal Modeling (ICTM) for cross-modal valuable channel information selection and powerful temporal modeling. Our method achieves state-of-the-art performance (93.9% in Rank-1 and 98.8% in Rank-5) on the SUSTech1K dataset, demonstrating its effectiveness.

LiCAF: LiDAR-Camera Asymmetric Fusion for Gait Recognition

TL;DR

LiCAF addresses gait recognition by leveraging LiDAR-camera fusion with a modality-sensitive, asymmetric design. It introduces ACCA for cross-modal channel attention and ICTM for interlaced cross-modal temporal modeling, enabling fine-grained, temporally-aware fusion of depth images and silhouettes. On SUSTech1K, LiCAF achieves state-of-the-art Rank-1 and Rank-5 accuracy, with ablations confirming the effectiveness of both ACCA and ICTM and the benefit of asymmetric information flow. This work demonstrates the practical value of modality-aware fusion and temporal modeling for robust gait representations in challenging, real-world conditions.

Abstract

Gait recognition is a biometric technology that identifies individuals by using walking patterns. Due to the significant achievements of multimodal fusion in gait recognition, we consider employing LiDAR-camera fusion to obtain robust gait representations. However, existing methods often overlook intrinsic characteristics of modalities, and lack fine-grained fusion and temporal modeling. In this paper, we introduce a novel modality-sensitive network LiCAF for LiDAR-camera fusion, which employs an asymmetric modeling strategy. Specifically, we propose Asymmetric Cross-modal Channel Attention (ACCA) and Interlaced Cross-modal Temporal Modeling (ICTM) for cross-modal valuable channel information selection and powerful temporal modeling. Our method achieves state-of-the-art performance (93.9% in Rank-1 and 98.8% in Rank-5) on the SUSTech1K dataset, demonstrating its effectiveness.
Paper Structure (13 sections, 7 equations, 3 figures, 4 tables)

This paper contains 13 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A simplified diagram of four commonly used strategies to model multi-modal information. In this context, Modeling $A\gets B$ indicates the process of modeling Modality A, incorporating supplementary information from Modality B. The notation $\mathcal{F}(A\gets B)$ represents the modeling results of Modality A, including information from Modality B.
  • Figure 2: Overview of LiCAF. Depth image features $F_L$ and silhouette features $F_C$ are obtained from the LiDAR Feature Extractor and Camera Feature Extractor. ACCA selects channel information from $F_L$ and $F_C$, resulting in channel-enhanced features $E_L$ and $E_C$. Subsequently, $E_L$ and $E_C$ are passed through an HPP operation to form the inputs $S_L$ and $S_C$ for ICTM. Next, ICTM performs temporal modeling with $L$ layers, yielding camera features $S_L^{cls}$ and LiDAR features $S_C^{cls}$. Finally, the fusion of $S_L^{cls}$ and $S_C^{cls}$ results in the gait representation $S_{Fusion}$. Here, $L_{tri}$ and $L_{ce}$ represent triplet loss and cross-entropy loss respectively, and $\alpha$ and $\beta$ after ACCA are learnable weights.
  • Figure 3: The detailed structure of ACCA. Here, $\mathrm{TP}$ denotes temporal max pooling, $\mathrm{ GAP }$ represents spatial global average pooling, $\mathrm{ FC }$ refers to a linear projection layer, and $\mathrm{ ReLU }$ indicates the activation function.