Table of Contents
Fetching ...

Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition

Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, Tao He

TL;DR

This work introduces Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) and presents the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework to bridge cross-domain emotion recognition. It decomposes KCDP into KADAP, which aligns domain-agnostic knowledge via textual cues and cross-attention within a diffusion perceptual backbone, and CLIEA, which generates high-quality pseudo-labels for the target domain using counterfactual language-image alignment. The approach leverages LoRA-based fine-tuning for efficient parameter updates and a Mixture of Experts to fuse multimodal cues, achieving state-of-the-art results across domain adaptation and universal cross-domain settings, including notable gains over TGCA-PVT and EmoVIT baselines. The work provides a new benchmark for UCDVER and demonstrates the practical impact of combining domain-agnostic knowledge with counterfactual reasoning to improve cross-domain emotion understanding.

Abstract

Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring emotional states of individuals based on visual cues. However, existing works focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To fill this gap, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework. Specifically, KCDP leverages a VLM to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our model surpasses SOTA models in both perceptibility and generalization, e.g., gaining 12% improvements over the SOTA VER model TGCA-PVT. The project page is at https://yinwen2019.github.io/ucdver.

Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition

TL;DR

This work introduces Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) and presents the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework to bridge cross-domain emotion recognition. It decomposes KCDP into KADAP, which aligns domain-agnostic knowledge via textual cues and cross-attention within a diffusion perceptual backbone, and CLIEA, which generates high-quality pseudo-labels for the target domain using counterfactual language-image alignment. The approach leverages LoRA-based fine-tuning for efficient parameter updates and a Mixture of Experts to fuse multimodal cues, achieving state-of-the-art results across domain adaptation and universal cross-domain settings, including notable gains over TGCA-PVT and EmoVIT baselines. The work provides a new benchmark for UCDVER and demonstrates the practical impact of combining domain-agnostic knowledge with counterfactual reasoning to improve cross-domain emotion understanding.

Abstract

Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring emotional states of individuals based on visual cues. However, existing works focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To fill this gap, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework. Specifically, KCDP leverages a VLM to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our model surpasses SOTA models in both perceptibility and generalization, e.g., gaining 12% improvements over the SOTA VER model TGCA-PVT. The project page is at https://yinwen2019.github.io/ucdver.

Paper Structure

This paper contains 14 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of two significant challenges of UCDVER in (a) and (b). Our proposed Knowledge-aligned Counterfactual- enhancement Diffusion Perception (KCDP) (in Figure c) aligns different domains in a shared knowledge space which guides the diffusion model to bridge the domain gap.
  • Figure 2: Overview of KCDP, comprising KADAP (§ \ref{['sec:KADAP']}, RGB]242,250,251green box), CLIEA (§ \ref{['sec:CLIEA']}, RGB]254,236,245pink box) and Optimization (§ \ref{['sec:Optimization']}, RGB]242,244,244gray box). Specifically, KADAP uses a BLIP to obtain captions and then employs a knowledge parser to extract knowledge triples. We leverage the CLIP text encoder to encode the knowledge, whose embeddings guide the denoising network via a cross-attention component. The final visual and knowledge representations are fused together to classify emotions via a MoE-based predictor. CLIEA constructs counterfactual samples using knowledge features from causal graphs. Then a multi-head attention mechanism and a mapping network are used to map linguistic and visual features to the emotional space for alignment to obtain high-quality pseudo-labels of the target domain.
  • Figure 3: Detail of our CLIEA causal graph. (a) Factual causality $Y_{v,k,p}(X)$. (b) Conterfactual causality $Y_{v,k,p^*}(X).$
  • Figure 4: Effectiveness of different fine-tuning strategies on the E→S and S→E task. Notably, we only report the results on the target domain.
  • Figure 5: Visualization of cross-domain embedding versus baseline models using t-SNE tsne on E→S task. Red and blue points represent the embedded representation of the source domain sample and the target domain sample, respectively.