Table of Contents
Fetching ...

Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

Yufei Wang, Haixu Liu, Tianxiang Xu, Chuancheng Shi, Hongsheng Xing

TL;DR

The paper tackles hidden emotion understanding in videos under data scarcity and class imbalance by introducing a multimodal, weakly supervised framework that leverages vision-language models for pseudo-labeling. It combines frame-level portrait features from Dinov2, OpenPose-based keypoint streams modeled with Transformer backbones (including an efficient MLP variant), and BERT-encoded textual reasoning generated via CoT+Reflection prompts to guide emotion inference, with pseudo-labels augmenting the training set in a two-stage regime. Key findings show that an MLP-based keypoint backbone can match or exceed GCN-based counterparts while reducing computation, and the proposed weakly supervised curriculum yields state-of-the-art accuracy on the iMiGUE tennis-interview dataset (over 0.69). The approach demonstrates the practical viability of large-model pseudo-labeling for weak supervision in multimodal video tasks and provides a foundation for further improvements in cross-modal short-horizon reasoning and 4D facial analysis.

Abstract

To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.

Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

TL;DR

The paper tackles hidden emotion understanding in videos under data scarcity and class imbalance by introducing a multimodal, weakly supervised framework that leverages vision-language models for pseudo-labeling. It combines frame-level portrait features from Dinov2, OpenPose-based keypoint streams modeled with Transformer backbones (including an efficient MLP variant), and BERT-encoded textual reasoning generated via CoT+Reflection prompts to guide emotion inference, with pseudo-labels augmenting the training set in a two-stage regime. Key findings show that an MLP-based keypoint backbone can match or exceed GCN-based counterparts while reducing computation, and the proposed weakly supervised curriculum yields state-of-the-art accuracy on the iMiGUE tennis-interview dataset (over 0.69). The approach demonstrates the practical viability of large-model pseudo-labeling for weak supervision in multimodal video tasks and provides a foundation for further improvements in cross-modal short-horizon reasoning and 4D facial analysis.

Abstract

To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.
Paper Structure (15 sections, 3 figures, 3 tables)

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Visualization of Openpose Keypoint Connection
  • Figure 2: Visualization of frames Distribution
  • Figure 3: Visualization of Model Structure