Table of Contents
Fetching ...

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Wenzhuo Liu, Wenshuo Wang, Yicheng Qiao, Qiannan Guo, Jiayin Zhu, Pengfei Li, Zilong Chen, Huiming Yang, Zhiwei Li, Lening Wang, Tiao Tan, Huaping Liu

TL;DR

The paper tackles the challenge of enabling ADAS to jointly understand driver internal states and surrounding traffic context by proposing MMTL-UniAD, a unified multimodal multi-task framework. It introduces a Multi-axis Region Attention Network (MARNet) to extract task-relevant features from multi-view images and a Dual-Branch Multimodal Embedding to balance shared and task-specific learning across four tasks (driver emotion, driver behavior, traffic context, and vehicle behavior). Comprehensive experiments on the AIDE dataset, including extensive ablations, demonstrate superior performance over state-of-the-art methods and underscore the importance of both MARNet and the dual-branch embedding for mitigating negative transfer and enabling cross-task knowledge sharing. The findings offer a robust, scalable baseline for integrated multimodal MTL in ADAS with potential to improve real-world driving safety through richer context understanding.

Abstract

Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

TL;DR

The paper tackles the challenge of enabling ADAS to jointly understand driver internal states and surrounding traffic context by proposing MMTL-UniAD, a unified multimodal multi-task framework. It introduces a Multi-axis Region Attention Network (MARNet) to extract task-relevant features from multi-view images and a Dual-Branch Multimodal Embedding to balance shared and task-specific learning across four tasks (driver emotion, driver behavior, traffic context, and vehicle behavior). Comprehensive experiments on the AIDE dataset, including extensive ablations, demonstrate superior performance over state-of-the-art methods and underscore the importance of both MARNet and the dual-branch embedding for mitigating negative transfer and enabling cross-task knowledge sharing. The findings offer a robust, scalable baseline for integrated multimodal MTL in ADAS with potential to improve real-world driving safety through richer context understanding.

Abstract

Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.

Paper Structure

This paper contains 22 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Traffic context and driver states interaction diagram. Tasks (a), (b), (c), and (d) represent traffic context recognition, vehicle behavior recognition, driver behavior recognition, and driver emotion recognition, respectively. These tasks comprehensively demonstrate the complex and closely interconnected relationships between the driver and traffic.
  • Figure 2: The overall pipeline of MMTL-UniAD. MMTL-UniAD consists of two primary components: Multimodal Encoder and Dual-Branch Multimodal Embedding. The multimodal encoder is composed of a Multi-axis Regional Attention Network (MARNet) and a 3D-CNN, which are responsible for extracting features from multi-view images and driver joint, respectively. The Dual-Branch Multimodal Embeddings further integrate the multimodal features for multi-task recognition.
  • Figure 3: Diagram of different self-attention. (a) represents the most common global self-attention in images; (b) (c) (d) representing vertical attention, horizontal attention and horizontal-vertical attention respectively. Among them (d) represents the horizontal-vertical attention we introduced.
  • Figure 4: The flowchart of the MARNet architecture, including the processes for horizontal-vertical attention and region attention.
  • Figure 5: Structural of Dual-Branch Multimodal Embedding.