MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Wenzhuo Liu; Wenshuo Wang; Yicheng Qiao; Qiannan Guo; Jiayin Zhu; Pengfei Li; Zilong Chen; Huiming Yang; Zhiwei Li; Lening Wang; Tiao Tan; Huaping Liu

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Wenzhuo Liu, Wenshuo Wang, Yicheng Qiao, Qiannan Guo, Jiayin Zhu, Pengfei Li, Zilong Chen, Huiming Yang, Zhiwei Li, Lening Wang, Tiao Tan, Huaping Liu

TL;DR

The paper tackles the challenge of enabling ADAS to jointly understand driver internal states and surrounding traffic context by proposing MMTL-UniAD, a unified multimodal multi-task framework. It introduces a Multi-axis Region Attention Network (MARNet) to extract task-relevant features from multi-view images and a Dual-Branch Multimodal Embedding to balance shared and task-specific learning across four tasks (driver emotion, driver behavior, traffic context, and vehicle behavior). Comprehensive experiments on the AIDE dataset, including extensive ablations, demonstrate superior performance over state-of-the-art methods and underscore the importance of both MARNet and the dual-branch embedding for mitigating negative transfer and enabling cross-task knowledge sharing. The findings offer a robust, scalable baseline for integrated multimodal MTL in ADAS with potential to improve real-world driving safety through richer context understanding.

Abstract

Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

TL;DR

Abstract

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)