Table of Contents
Fetching ...

Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer

Xinpeng Li, Teng Wang, Jian Zhao, Shuyi Mao, Jinbao Wang, Feng Zheng, Xiaojiang Peng, Xuelong Li

TL;DR

The paper tackles context-aware emotion recognition by introducing a single-stage framework built on a Decoupled Subject-Context Transformer (DSCT). It jointly localizes subjects and classifies their emotions using decoupled subject and context queries that are fused early via spatial-semantic relations, enabling fine-grained interaction between subject-centric cues and contextual information. Through experiments on CAER-S and EMOTIC, the approach outperforms two-stage baselines while using fewer parameters, with ablations underscoring the importance of early fusion and DSCT components. The work advances end-to-end emotion recognition by integrating localization and classification into a unified transformer-based framework that effectively leverages context.

Abstract

Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a decouple-then-fuse manner. The decoupled query token--subject queries and context queries--gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.

Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer

TL;DR

The paper tackles context-aware emotion recognition by introducing a single-stage framework built on a Decoupled Subject-Context Transformer (DSCT). It jointly localizes subjects and classifies their emotions using decoupled subject and context queries that are fused early via spatial-semantic relations, enabling fine-grained interaction between subject-centric cues and contextual information. Through experiments on CAER-S and EMOTIC, the approach outperforms two-stage baselines while using fewer parameters, with ablations underscoring the importance of early fusion and DSCT components. The work advances end-to-end emotion recognition by integrating localization and classification into a unified transformer-based framework that effectively leverages context.

Abstract

Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a decouple-then-fuse manner. The decoupled query token--subject queries and context queries--gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.
Paper Structure (16 sections, 9 equations, 13 figures, 13 tables)

This paper contains 16 sections, 9 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Motivation of single-stage framework. Contexts play a vital and nuanced role in emotion recognition. In (a) and (b), prior methods include two stages: subject without or with context (blue and gold rectangles) region localization and emotion classification without or with late fusion. In (c), we propose a single-stage framework for simultaneous localization and classification and decoupled subject-context transformer with early fusion. Our method notices useful and subtle emotional cues (blue and gold triangles).
  • Figure 2: Performance vs. model efficiency of different methods on EMOTIC (red) and CAER-S (blue). Our proposed single-stage framework (star) achieves state-of-the-art performance with fewer parameters than two-stage prior arts (circle).
  • Figure 3: Overall architecture of our single-stage emotion recognition approach for simultaneous subject localization and emotion classification, employing a Decoupled Subject-Context Transformer (DSCT) with early subject-context fusion.
  • Figure 4: Illustration of the DSCT. The left figure shows the reference points of the subject (orange diamond) and context queries (green circle). The right part describes the spatial-semantic relational aggregation.
  • Figure 5: The output visualization on EMOTIC.
  • ...and 8 more figures