Table of Contents
Fetching ...

Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification

Akshit Pramod Anchan, Jewelith Thomas, Sritama Roy

TL;DR

This work demonstrates a modular deep learning framework for assistive perception by developing independent modules for gaze (eye-state) detection, facial expression recognition, and speaker identification. Each module leverages domain-specific models—CNNs with transfer learning for gaze, CNNs for FER on FER2013, and LSTM with MFCC features for speaker ID—achieving high accuracies (93.0%, 97.8%, and 96.89%, respectively). The findings validate lightweight, task-focused architectures as reliable building blocks for real-time, multimodal assistive systems in resource-constrained environments. The study lays a foundation for future pipeline integration, latency optimization, and broader evaluation across diverse datasets and demographics.

Abstract

Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.

Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification

TL;DR

This work demonstrates a modular deep learning framework for assistive perception by developing independent modules for gaze (eye-state) detection, facial expression recognition, and speaker identification. Each module leverages domain-specific models—CNNs with transfer learning for gaze, CNNs for FER on FER2013, and LSTM with MFCC features for speaker ID—achieving high accuracies (93.0%, 97.8%, and 96.89%, respectively). The findings validate lightweight, task-focused architectures as reliable building blocks for real-time, multimodal assistive systems in resource-constrained environments. The study lays a foundation for future pipeline integration, latency optimization, and broader evaluation across diverse datasets and demographics.

Abstract

Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.

Paper Structure

This paper contains 12 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Visualisation of the eye detection model.
  • Figure 2: Visualisation of the facial expression model.
  • Figure 3: Architecture of a general LSTM network.
  • Figure 4: Confusion matrix of the eye detection model.
  • Figure 5: (a) Accuracy graph and (b) Loss graph of the eye detection model.
  • ...and 4 more figures