Table of Contents
Fetching ...

Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization

Cailing Han, Zhangbin Li, Jinxing Zhou, Wei Qian, Jingjing Hu, Yanghao Zhou, Zhangling Duan, Dan Guo

Abstract

Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.

Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization

Abstract

Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.
Paper Structure (18 sections, 12 equations, 7 figures, 10 tables)

This paper contains 18 sections, 12 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Illustration of the task and motivation. (a) Given the point-level sentiment points in the untrimmed video, the model needs to achieve sentiment classification and boundary localization. (b) We explored the facial-centric method to guide sentiment clues discovery. (c) To alleviate sentiment boundary jitter and discontinuity under weak supervision, we propose a novel pseudo-label smoothing strategy.
  • Figure 2: Overview of our FSENet framework. (a)FSD module introduces facial features into the temporal sentiment localization task through two branches to advance discovery of sentiment clues: one focuses on facial-centric interaction, while the other employs global sentiment perception weighting score. (b)PSSC aims to identify similar and dissimilar semantic points in the temporal axis to enhance the point-level sentiment discrimination by contrastive learning. (c)BSPG transforms sentiment point-level annotations into segment-level sentiment boundaries pseudo-label using the threshold filtering and step-by-step boundary-aware scheme, where sentiment scores decay progressively based on parameters $\beta$ and $w$.
  • Figure 3: Visualization of the predicted results for temporal sentiment localization. The 'pink' segments are positive sentiment, 'blue' means negative, and 'gray' means no valid location output. $\triangle$ indicates that methods incorporate facial features settings.
  • Figure 4: T-SNE visualization comparing our FSENet with SOTA. Each point represents a segment-level feature (‘p/n/n-sent’ denotes positive, negative, and non-sentiment categories, respectively).
  • Figure 5: Ablation study of Top-K parameter $k$ in PSSC on mAP performance.
  • ...and 2 more figures