Table of Contents
Fetching ...

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei

TL;DR

This work tackles dynamic facial expression captioning in videos by creating the Dynamic Facial Expression Caption (DFEC) dataset FDA with 5,033 manually annotated clips and over 700,000 tokens, then introducing FaceTrack-MM to robustly encode main-character facial regions in multi-person scenes via a dynamic face-tracking module. It combines instruction tuning on FDA with a facial-prior visual encoder and an LLM to generate detailed, hallucination-free captions, and proposes the Temporal Event Matching (TEM) metric to jointly assess semantic content and event ordering. The authors also introduce FEC-Bench as a dedicated benchmark to evaluate video MLLMs on DFEC tasks, and demonstrate substantial gains over existing models through extensive ablations and qualitative analyses. The work provides a practical framework, data resources, and evaluation tools to advance fine-grained facial expression understanding in video-language systems, with potential impact on HCI, entertainment, and affective computing applications.

Abstract

Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

TL;DR

This work tackles dynamic facial expression captioning in videos by creating the Dynamic Facial Expression Caption (DFEC) dataset FDA with 5,033 manually annotated clips and over 700,000 tokens, then introducing FaceTrack-MM to robustly encode main-character facial regions in multi-person scenes via a dynamic face-tracking module. It combines instruction tuning on FDA with a facial-prior visual encoder and an LLM to generate detailed, hallucination-free captions, and proposes the Temporal Event Matching (TEM) metric to jointly assess semantic content and event ordering. The authors also introduce FEC-Bench as a dedicated benchmark to evaluate video MLLMs on DFEC tasks, and demonstrate substantial gains over existing models through extensive ablations and qualitative analyses. The work provides a practical framework, data resources, and evaluation tools to advance fine-grained facial expression understanding in video-language systems, with potential impact on HCI, entertainment, and affective computing applications.

Abstract

Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.
Paper Structure (28 sections, 12 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of annotation styles between different facial expression tasks. Among these, DFER only includes one selected fundamental emotional category, whereas FAUD contains several specific action units. Our DFEC includes detailed facial movements described using natural language.
  • Figure 2: Some statistics of proposed DFEC dataset.
  • Figure 3: Architecture of FaceTrack-MM. Our FaceTrack-MM leverages FaceXFormer narayan2024facexformer as the auxiliary facial visual encoder to extract facial features of the main characters and uses CLIP-ViT-Large radford2021clip as the visual encoder. We utilize the STC module damonlpsg2024videollama2 as the visual projector to inject temporal information and use Mistral-7B-Instruct jiang2024mixtralexperts for the pretrained large language model.
  • Figure 4: Qualitative result comparison. We highlight the content related to emotional and expressive changes in different methods as well as in the ground truth (GT). Our model demonstrates superior capability in capturing changes in facial expressions.
  • Figure 5: Pipeline of the FDA annotation process. We use ChatGPT to preliminarily annotate the emotions and facial changes of the person in the video and introduce manual correcton and consolidation to get the refined final annotations.
  • ...and 7 more figures