Table of Contents
Fetching ...

AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition

Zhishu Liu, Kaishen Yuan, Bo Zhao, Hui Ma, Zitong Yu

TL;DR

A reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference and introduces Counterfactual Consistency Regularization (CCR) to construct counterfactual samples, enhancing the model's generalization.

Abstract

Micro-expression Action Unit (AU) detection identifies localized AUs from subtle facial muscle activations, providing a foundation for decoding affective cues. Previous methods face three key limitations: (1) heavy reliance on low-density visual information, rendering discriminative evidence vulnerable to background noise; (2) coarse-grained feature processing that misaligns with the demand for fine-grained representations; and (3) neglect of inter-AU correlations, restricting the parsing of complex expression patterns. We propose AULLM++, a reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference. It formulates AU prediction into three stages: evidence construction, structure modeling, and deduction-based prediction. Specifically, a Multi-Granularity Evidence-Enhanced Fusion Projector (MGE-EFP) fuses mid-level texture cues with high-level semantics, distilling them into a compact Content Token (CT). Furthermore, inspired by micro- and macro-expression AU correspondence, we encode AU relationships as a sparse structural prior and learn interaction strengths via a Relation-Aware AU Graph Neural Network (R-AUGNN), producing an Instruction Token (IT). We then fuse CT and IT into a structured textual prompt and introduce Counterfactual Consistency Regularization (CCR) to construct counterfactual samples, enhancing the model's generalization. Extensive experiments demonstrate AULLM++ achieves state-of-the-art performance on standard benchmarks and exhibits superior cross-domain generalization.

AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition

TL;DR

A reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference and introduces Counterfactual Consistency Regularization (CCR) to construct counterfactual samples, enhancing the model's generalization.

Abstract

Micro-expression Action Unit (AU) detection identifies localized AUs from subtle facial muscle activations, providing a foundation for decoding affective cues. Previous methods face three key limitations: (1) heavy reliance on low-density visual information, rendering discriminative evidence vulnerable to background noise; (2) coarse-grained feature processing that misaligns with the demand for fine-grained representations; and (3) neglect of inter-AU correlations, restricting the parsing of complex expression patterns. We propose AULLM++, a reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference. It formulates AU prediction into three stages: evidence construction, structure modeling, and deduction-based prediction. Specifically, a Multi-Granularity Evidence-Enhanced Fusion Projector (MGE-EFP) fuses mid-level texture cues with high-level semantics, distilling them into a compact Content Token (CT). Furthermore, inspired by micro- and macro-expression AU correspondence, we encode AU relationships as a sparse structural prior and learn interaction strengths via a Relation-Aware AU Graph Neural Network (R-AUGNN), producing an Instruction Token (IT). We then fuse CT and IT into a structured textual prompt and introduce Counterfactual Consistency Regularization (CCR) to construct counterfactual samples, enhancing the model's generalization. Extensive experiments demonstrate AULLM++ achieves state-of-the-art performance on standard benchmarks and exhibits superior cross-domain generalization.
Paper Structure (15 sections, 9 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Micro-expression AU detection is challenged by subtle intensity and visually similar AU mixtures (e.g., AU4+7 vs. AU4+15+17), which leads to ambiguity for conventional detectors. (b) We integrate multi-granularity visual evidence with a structured AU-relation prior and leverage LLM-based reasoning to produce coherent AU predictions, yielding stronger cross-dataset generalization.
  • Figure 2: Overall architecture of AULLM++. The model constructs a compact visual evidence token $T_v$ and a psychology-guided AU-structure instruction token $\tau_{au}$, injects them into an LLM for AU reasoning and classification, and applies CCR during training for robustness.
  • Figure 3: t-SNE visualization of high-level features across CASME II, SAMM, and 4DME-Micro datasets. (a) The baseline LED-SSSNet exhibits severe domain shifts with isolated clusters. (b) AULLM++ demonstrates significantly improved domain alignment and feature entanglement.
  • Figure 4: Visualization of feature evolution.