Table of Contents
Fetching ...

VEMOCLAP: A video emotion classification web application

Serkan Sulun, Paula Viana, Matthew E. P. Davies

TL;DR

This work improves the previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention to increase classification accuracy on the Ekman-6 video emotion dataset.

Abstract

We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at serkansulun.com/app.

VEMOCLAP: A video emotion classification web application

TL;DR

This work improves the previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention to increase classification accuracy on the Ekman-6 video emotion dataset.

Abstract

We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at serkansulun.com/app.

Paper Structure

This paper contains 10 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Video emotion classification pipeline. Blocks with rounded and dashed outlines represent trained modules. The models with parentheses are used conditionally. Other blocks are pretrained feature extractors and are used in inference mode. Q, K, and V represent query, key, and value projections.
  • Figure 2: Confusion matrix with values normalized over true labels on the test split of Ekman-6 dataset.