Table of Contents
Fetching ...

CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition

Ruoyu Wang, Chen Cai, Wenqian Wang, Jianjun Gao, Dan Lin, Wenyang Liu, Kim-Hui Yap

TL;DR

CM2-Net addresses data scarcity and domain gaps for non-RGB modalities in driver action recognition by continual cross-modal learning. It initializes with an RGB encoder and uses Accumulative Cross-modal Mapping Prompting to transfer discriminative RGB features into new modality spaces, guided by a frozen language encoder for semantic supervision. Training employs a composite contrastive objective that aligns modalities and leverages mapped features as prompts, with prompts accumulating across modalities. On Drive&Act, CM2-Net achieves state-of-the-art results for uni- and multi-modal recognition, especially for IR and Depth, demonstrating effective cross-modal knowledge transfer and practical gains for robust in-car perception.

Abstract

Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.

CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition

TL;DR

CM2-Net addresses data scarcity and domain gaps for non-RGB modalities in driver action recognition by continual cross-modal learning. It initializes with an RGB encoder and uses Accumulative Cross-modal Mapping Prompting to transfer discriminative RGB features into new modality spaces, guided by a frozen language encoder for semantic supervision. Training employs a composite contrastive objective that aligns modalities and leverages mapped features as prompts, with prompts accumulating across modalities. On Drive&Act, CM2-Net achieves state-of-the-art results for uni- and multi-modal recognition, especially for IR and Depth, demonstrating effective cross-modal knowledge transfer and practical gains for robust in-car perception.

Abstract

Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
Paper Structure (16 sections, 14 equations, 3 figures, 4 tables)

This paper contains 16 sections, 14 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The features extracted by pre-trained encoders (such as RGB) are discriminative and informative girdhar2023imagebind. Instead of training a new encoder from scratch without any prior knowledge, we propose to map the well-extracted features into the new modality feature space to prompt the training. The prompt assists in aligning the features extracted by the new encoder with textual embeddings representing driver actions in the semantic space, thereby enhancing the robustness and accuracy of driver action recognition.
  • Figure 2: Overview of CM$^2$-Net, a network designed for continual learning across different modalities in driver action recognition. Initially, CM$^2$-Net begins by fine-tuning an RGB encoder to learn discriminative RGB features and classifies the driver actions based on the similarity scores with label textual embeddings. Then, for a new modality (such as Depth), CM$^2$-Net employs Accumulative Cross-modal Mapping Prompting (ACMP) to train a modality-specific encoder (such as Depth Encoder) with the prompting from previously-learned modalities. ACMP can effectively map the accumulating discriminative features from multiple established modalities (such as RGB and IR) into the new modality feature space, prompting which crucial features should be extracted during encoder training. In this way, the prompts can improve the alignment between the new modality embeddings and textual embeddings for accurate classification.
  • Figure 3: A t-SNE visualization of extracted features from different action categories. Different colors represent different actions. The features extracted by the baseline network uniformerv2 (a) are more scattered than those extracted by CM$^2$-Net (b), which shows the efficacy of our method in mining discrimination information.