FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO
Fufangchen Zhao, Xuerui Qiu, Linrui Xu, Ming Li, Wenhao Jiang, Jinkai Zheng, Hehe Fan, Jian Gao, Danfeng Yan
TL;DR
FaVChat introduces a dedicated Video-MLLM for fine-grained facial video understanding by employing a three-level, prompt-guided visual encoding pipeline and a data-efficient GRPO training regime. The low-, mid-, and high-level prompting stages preserve textures, motion cues, discriminative regions, and semantic alignment, while two adapters enable adaptive fusion across encoders. The DE-GRPO framework couples per-sample facial rewards with a recurrent data mechanism to maximize learning efficiency from limited data, achieving state-of-the-art performance on emotion recognition, explainable reasoning, and facial recognition tasks with only 10K RL samples. A large, 60K video-summary plus 170K QA dataset underpins progressive three-stage training, yielding strong generalization and detailed, interpretable facial analysis across challenging benchmarks.
Abstract
Multi-modal large language models (MLLMs) have shown strong capability in video understanding but still struggle with fine-grained visual comprehension, as pure visual encoders often lose subtle cues essential for precise reasoning. To address this limitation, we propose FaVChat, a Video-MLLM specifically designed for fine-grained facial understanding. FaVChat introduces a multi-level prompt-guided feature extraction mechanism that progressively captures task-relevant information from three complementary stages: low-level transformer layers for textures and motion, medium-level learnable queries for discriminative regions, and high-level adaptive feature weighting for semantic alignment. These enriched features are dynamically fused and fed into the LLM to enable more accurate fine-grained reasoning. To further enhance the model's ability to capture fine-grained facial attributes and maximize the utility of limited data, we propose Date-Efficient GRPO, a novel data-efficient reinforcement learning (RL) algorithm that maximizes the utility of each training sample through per-instance utility estimation and dynamic lifecycle scheduling. Extensive zero-shot evaluations across emotion recognition, explainable reasoning, and textual expression analysis demonstrate that FaVChat achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs, even when trained with only 10K RL samples.
