Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model
Bita Baroutian, Atefe Aghaei, Mohsen Ebrahimi Moghaddam
TL;DR
The paper tackles non-invasive intoxication detection from video by proposing a recurrent fusion model that combines GAT-based facial landmark dynamics with a 3D-ResNet spatiotemporal feature stream. It introduces a shot-based processing workflow and a learnable adaptive fusion to robustly integrate modalities. A new YouTube-derived dataset of 3,542 clips from 202 individuals supports training and evaluation, with strong results showing 95.82% accuracy and 0.977 precision/recall, outperforming 3D-CNN and VGGFace+LSTM baselines. The approach demonstrates practical potential for public-safety monitoring while highlighting considerations around demographics, interpretability, and future extensions to related affective or safety-critical tasks.
Abstract
Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.
