Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation Models
Esmaeil Seraj, Walter Talamonti
TL;DR
This study investigates in-vehicle multi-task facial attribute recognition using synthetic data and pre-trained vision foundation models (ViT and ResNet). It systematically compares preprocessing strategies, adaptation methods, and training techniques, revealing a counter-intuitive finding: ResNet often outperforms ViT when task complexity and model capacity are mismatched, especially under limited data. The work highlights the nuanced impact of synthetic data distributions on in-distribution versus out-of-distribution performance and underscores the need for diverse, realistic data and robust adaptation to achieve practical, real-world generalization in vehicle-driver perception. Overall, the results demonstrate the potential of synthetic data and foundation models for rapid, multi-task in-vehicle perception while emphasizing challenges in generalization to real-world conditions.
Abstract
In the burgeoning field of intelligent transportation systems, enhancing vehicle-driver interaction through facial attribute recognition, such as facial expression, eye gaze, age, etc., is of paramount importance for safety, personalization, and overall user experience. However, the scarcity of comprehensive large-scale, real-world datasets poses a significant challenge for training robust multi-task models. Existing literature often overlooks the potential of synthetic datasets and the comparative efficacy of state-of-the-art vision foundation models in such constrained settings. This paper addresses these gaps by investigating the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle, such as gaze plane, age, and facial expression. Utilizing transfer learning techniques with both pre-trained Vision Transformer (ViT) and Residual Network (ResNet) models, we explore various training and adaptation methods to optimize performance, particularly when data availability is limited. We provide extensive post-evaluation analysis, investigating the effects of synthetic data distributions on model performance in in-distribution data and out-of-distribution inference. Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context, which is attributed to the mismatch in model complexity relative to task complexity. Our results highlight the challenges and opportunities for enhancing the use of synthetic data and vision foundation models in practical applications.
