Table of Contents
Fetching ...

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

Fufangchen Zhao, Xuerui Qiu, Linrui Xu, Ming Li, Wenhao Jiang, Jinkai Zheng, Hehe Fan, Jian Gao, Danfeng Yan

TL;DR

FaVChat introduces a dedicated Video-MLLM for fine-grained facial video understanding by employing a three-level, prompt-guided visual encoding pipeline and a data-efficient GRPO training regime. The low-, mid-, and high-level prompting stages preserve textures, motion cues, discriminative regions, and semantic alignment, while two adapters enable adaptive fusion across encoders. The DE-GRPO framework couples per-sample facial rewards with a recurrent data mechanism to maximize learning efficiency from limited data, achieving state-of-the-art performance on emotion recognition, explainable reasoning, and facial recognition tasks with only 10K RL samples. A large, 60K video-summary plus 170K QA dataset underpins progressive three-stage training, yielding strong generalization and detailed, interpretable facial analysis across challenging benchmarks.

Abstract

Multi-modal large language models (MLLMs) have shown strong capability in video understanding but still struggle with fine-grained visual comprehension, as pure visual encoders often lose subtle cues essential for precise reasoning. To address this limitation, we propose FaVChat, a Video-MLLM specifically designed for fine-grained facial understanding. FaVChat introduces a multi-level prompt-guided feature extraction mechanism that progressively captures task-relevant information from three complementary stages: low-level transformer layers for textures and motion, medium-level learnable queries for discriminative regions, and high-level adaptive feature weighting for semantic alignment. These enriched features are dynamically fused and fed into the LLM to enable more accurate fine-grained reasoning. To further enhance the model's ability to capture fine-grained facial attributes and maximize the utility of limited data, we propose Date-Efficient GRPO, a novel data-efficient reinforcement learning (RL) algorithm that maximizes the utility of each training sample through per-instance utility estimation and dynamic lifecycle scheduling. Extensive zero-shot evaluations across emotion recognition, explainable reasoning, and textual expression analysis demonstrate that FaVChat achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs, even when trained with only 10K RL samples.

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

TL;DR

FaVChat introduces a dedicated Video-MLLM for fine-grained facial video understanding by employing a three-level, prompt-guided visual encoding pipeline and a data-efficient GRPO training regime. The low-, mid-, and high-level prompting stages preserve textures, motion cues, discriminative regions, and semantic alignment, while two adapters enable adaptive fusion across encoders. The DE-GRPO framework couples per-sample facial rewards with a recurrent data mechanism to maximize learning efficiency from limited data, achieving state-of-the-art performance on emotion recognition, explainable reasoning, and facial recognition tasks with only 10K RL samples. A large, 60K video-summary plus 170K QA dataset underpins progressive three-stage training, yielding strong generalization and detailed, interpretable facial analysis across challenging benchmarks.

Abstract

Multi-modal large language models (MLLMs) have shown strong capability in video understanding but still struggle with fine-grained visual comprehension, as pure visual encoders often lose subtle cues essential for precise reasoning. To address this limitation, we propose FaVChat, a Video-MLLM specifically designed for fine-grained facial understanding. FaVChat introduces a multi-level prompt-guided feature extraction mechanism that progressively captures task-relevant information from three complementary stages: low-level transformer layers for textures and motion, medium-level learnable queries for discriminative regions, and high-level adaptive feature weighting for semantic alignment. These enriched features are dynamically fused and fed into the LLM to enable more accurate fine-grained reasoning. To further enhance the model's ability to capture fine-grained facial attributes and maximize the utility of limited data, we propose Date-Efficient GRPO, a novel data-efficient reinforcement learning (RL) algorithm that maximizes the utility of each training sample through per-instance utility estimation and dynamic lifecycle scheduling. Extensive zero-shot evaluations across emotion recognition, explainable reasoning, and textual expression analysis demonstrate that FaVChat achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs, even when trained with only 10K RL samples.

Paper Structure

This paper contains 39 sections, 19 equations, 7 figures, 14 tables, 2 algorithms.

Figures (7)

  • Figure 1: (a) The illustration of the proposed FaVChat for fine-grained facial video understanding. For input videos centered on human faces, FaVChat analyzes their fine-grained features based on the given prompts and provides fine-grained responses by integrating the analysis results with the posed questions. However, in the end-to-end user experience, the analysis results on the left side are not visible. (b) The Performance of FaVChat on different testsets.
  • Figure 2: verview of the proposed FaVChat framework. FaVChat augments the original visual encoder with an additional facial encoder narayan2024facexformer and incorporates a multi-level prompt-guided feature extraction mechanism, comprising: (i) low-level prompt-query learning for progressive integration of Transformer features, (ii) mid-level prompt-query learning to support learnable queries, and (iii) high-level prompt-query learning to steer weight adaptation. This hierarchical prompting scheme enhances FaVChat’s sensitivity to fine-grained visual cues in videos and strengthens fine-grained alignment between video and textual representations.
  • Figure 3: The performance of $R_\mathrm{FEM}$ in data recurrent mechanism ablation, where $i$ represents the number of iterations of the training.
  • Figure 4: Influene of different inference frames length.
  • Figure 5: Training Data Creation Process.
  • ...and 2 more figures