Table of Contents
Fetching ...

Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation Models

Esmaeil Seraj, Walter Talamonti

TL;DR

This study investigates in-vehicle multi-task facial attribute recognition using synthetic data and pre-trained vision foundation models (ViT and ResNet). It systematically compares preprocessing strategies, adaptation methods, and training techniques, revealing a counter-intuitive finding: ResNet often outperforms ViT when task complexity and model capacity are mismatched, especially under limited data. The work highlights the nuanced impact of synthetic data distributions on in-distribution versus out-of-distribution performance and underscores the need for diverse, realistic data and robust adaptation to achieve practical, real-world generalization in vehicle-driver perception. Overall, the results demonstrate the potential of synthetic data and foundation models for rapid, multi-task in-vehicle perception while emphasizing challenges in generalization to real-world conditions.

Abstract

In the burgeoning field of intelligent transportation systems, enhancing vehicle-driver interaction through facial attribute recognition, such as facial expression, eye gaze, age, etc., is of paramount importance for safety, personalization, and overall user experience. However, the scarcity of comprehensive large-scale, real-world datasets poses a significant challenge for training robust multi-task models. Existing literature often overlooks the potential of synthetic datasets and the comparative efficacy of state-of-the-art vision foundation models in such constrained settings. This paper addresses these gaps by investigating the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle, such as gaze plane, age, and facial expression. Utilizing transfer learning techniques with both pre-trained Vision Transformer (ViT) and Residual Network (ResNet) models, we explore various training and adaptation methods to optimize performance, particularly when data availability is limited. We provide extensive post-evaluation analysis, investigating the effects of synthetic data distributions on model performance in in-distribution data and out-of-distribution inference. Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context, which is attributed to the mismatch in model complexity relative to task complexity. Our results highlight the challenges and opportunities for enhancing the use of synthetic data and vision foundation models in practical applications.

Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation Models

TL;DR

This study investigates in-vehicle multi-task facial attribute recognition using synthetic data and pre-trained vision foundation models (ViT and ResNet). It systematically compares preprocessing strategies, adaptation methods, and training techniques, revealing a counter-intuitive finding: ResNet often outperforms ViT when task complexity and model capacity are mismatched, especially under limited data. The work highlights the nuanced impact of synthetic data distributions on in-distribution versus out-of-distribution performance and underscores the need for diverse, realistic data and robust adaptation to achieve practical, real-world generalization in vehicle-driver perception. Overall, the results demonstrate the potential of synthetic data and foundation models for rapid, multi-task in-vehicle perception while emphasizing challenges in generalization to real-world conditions.

Abstract

In the burgeoning field of intelligent transportation systems, enhancing vehicle-driver interaction through facial attribute recognition, such as facial expression, eye gaze, age, etc., is of paramount importance for safety, personalization, and overall user experience. However, the scarcity of comprehensive large-scale, real-world datasets poses a significant challenge for training robust multi-task models. Existing literature often overlooks the potential of synthetic datasets and the comparative efficacy of state-of-the-art vision foundation models in such constrained settings. This paper addresses these gaps by investigating the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle, such as gaze plane, age, and facial expression. Utilizing transfer learning techniques with both pre-trained Vision Transformer (ViT) and Residual Network (ResNet) models, we explore various training and adaptation methods to optimize performance, particularly when data availability is limited. We provide extensive post-evaluation analysis, investigating the effects of synthetic data distributions on model performance in in-distribution data and out-of-distribution inference. Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context, which is attributed to the mismatch in model complexity relative to task complexity. Our results highlight the challenges and opportunities for enhancing the use of synthetic data and vision foundation models in practical applications.
Paper Structure (48 sections, 5 equations, 15 figures, 2 tables)

This paper contains 48 sections, 5 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Example of a vision foundation model for in-vehicle perception and intelligence. Data is collected and used to train a deep neural network model. The trained model then generates a high-dimensional feature space representing the input data which can be used for any downstream task. To adapt the learned foundation model for downstream task, adaptation methods are needed.
  • Figure 2: Example of in-vehicle perception and intelligence via multi-task facial attribute recognition on synthetically generated data. Figure demonstrates an in-cabin perception system capable of understanding several facial attributes, such as gaze, age, and facial expression of the driver and passengers. Such modular system which is built on readily generated synthetic data and existing vision foundation models can be adapted to any desired downstream task, enabling an enhanced vehicle-driver interaction and passenger experience.
  • Figure 3: Our employed multi-task facial attribute recognition architecture via transfer learning from pre-trained vision foundation models. Our research employs two distinct vision foundation models as the backbone for our multi-task learning architecture: the Vision Transformer (ViT) and the Residual Network (ResNet). Both architectures share a common preprocessing block. The patchification block is only applied for the ViT model and is crucial for adapting image data into a format that the transformer architecture can effectively process. The output from these pre-trained models, which represents a high-dimensional feature space, is then directed into three separate task heads. The final layer in each separate task head is a linear layer with number of neurons equal to the number of classes for the respective task, outputting a probability vector for each prediction.
  • Figure 4: Image samples from the SynthA dataset. The RetinaFace pre-trained model serengil2020lightfaceserengil2021lightface has been applied to extract the face bounding box (i.e., green box) of the driver.
  • Figure 5: Image samples from the SynthB dataset. The RetinaFace pre-trained model serengil2020lightfaceserengil2021lightface has been applied to extract the face bounding box (i.e., green box) of the driver.
  • ...and 10 more figures