Table of Contents
Fetching ...

A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Hugo Bohy, Kevin El Haddad, Thierry Dutoit

TL;DR

This work presents a deep learning-based multimodal smile and laugh classification system, considering them as two different entities, and compares the use of audio and vision-based models as well as a fusion approach, showing that the fusion leads to a better generalization on unseen data.

Abstract

Smiles and laughs detection systems have attracted a lot of attention in the past decade contributing to the improvement of human-agent interaction systems. But very few considered these expressions as distinct, although no prior work clearly proves them to belong to the same category or not. In this work, we present a deep learning-based multimodal smile and laugh classification system, considering them as two different entities. We compare the use of audio and vision-based models as well as a fusion approach. We show that, as expected, the fusion leads to a better generalization on unseen data. We also present an in-depth analysis of the behavior of these models on the smiles and laughs intensity levels. The analyses on the intensity levels show that the relationship between smiles and laughs might not be as simple as a binary one or even grouping them in a single category, and so, a more complex approach should be taken when dealing with them. We also tackle the problem of limited resources by showing that transfer learning allows the models to improve the detection of confusing intensity levels.

A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

TL;DR

This work presents a deep learning-based multimodal smile and laugh classification system, considering them as two different entities, and compares the use of audio and vision-based models as well as a fusion approach, showing that the fusion leads to a better generalization on unseen data.

Abstract

Smiles and laughs detection systems have attracted a lot of attention in the past decade contributing to the improvement of human-agent interaction systems. But very few considered these expressions as distinct, although no prior work clearly proves them to belong to the same category or not. In this work, we present a deep learning-based multimodal smile and laugh classification system, considering them as two different entities. We compare the use of audio and vision-based models as well as a fusion approach. We show that, as expected, the fusion leads to a better generalization on unseen data. We also present an in-depth analysis of the behavior of these models on the smiles and laughs intensity levels. The analyses on the intensity levels show that the relationship between smiles and laughs might not be as simple as a binary one or even grouping them in a single category, and so, a more complex approach should be taken when dealing with them. We also tackle the problem of limited resources by showing that transfer learning allows the models to improve the detection of confusing intensity levels.
Paper Structure (11 sections, 5 figures, 1 table)

This paper contains 11 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Architectures of the audio model, the visual model and the fusion of both.
  • Figure 2: Class distribution heatmaps. Each column corresponds to the predicted class while each row shows the ground-truth label and its intensity. The colour gradient expresses the distribution in percentage per row (the sum of each row should be 100%).
  • Figure 3: t-SNE dimensional reduction applied to each modality. Graphic results are display in Fig. \ref{['fig:TSNE_A']} and \ref{['fig:TSNE_V']}.
  • Figure 4: Audio models outputs using a 2D t-SNE representation.
  • Figure 5: Visual models outputs using a 2D t-SNE representation.