Table of Contents
Fetching ...

DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition

Ren Zhang, Huilai Li, Chao qi, Guoliang Xu, Tianyu Zhou, Wei wei, Jianqin Yin

TL;DR

DEFT-LLM tackles micro-expression recognition by disentangling appearance and motion cues with three expert encoders and grounding supervision through motion-grounded Uni-MER data. It couples a structured multimodal prompting framework with a hybrid discriminative-generative objective to produce interpretable outputs (C, E, R) and robust predictions. Empirical results on challenging MER benchmarks show state-of-the-art accuracy and strong cross-dataset generalization, backed by ablations that confirm the value of each architectural component. The work provides a principled approach to grounding LLMs in physical motion signals for fine-grained emotion and AU recognition, improving both performance and interpretability in MER.

Abstract

Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.

DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition

TL;DR

DEFT-LLM tackles micro-expression recognition by disentangling appearance and motion cues with three expert encoders and grounding supervision through motion-grounded Uni-MER data. It couples a structured multimodal prompting framework with a hybrid discriminative-generative objective to produce interpretable outputs (C, E, R) and robust predictions. Empirical results on challenging MER benchmarks show state-of-the-art accuracy and strong cross-dataset generalization, backed by ablations that confirm the value of each architectural component. The work provides a principled approach to grounding LLMs in physical motion signals for fine-grained emotion and AU recognition, improving both performance and interpretability in MER.

Abstract

Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.

Paper Structure

This paper contains 89 sections, 8 equations, 6 figures, 16 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison of emotion reasoning between the MLLM (a), which misinterprets the emotion as Sadness due to entanglement feature. In contrast, our DEFT-LLM (b), employing multiple experts to encode distinct key clues, obtains disentangled features. By learning the correspondence between these clues and emotions, it correctly infers the answer as disgust.
  • Figure 2: The Uni-MER pipeline: Video is processed into quantified, region-specific Motion Evidence ($\mathbf{E}$). A bidirectional correspondence module (center) verifies this evidence against Ground Truth AUs ($\mathcal{A}$) to generate a grounded Rationale ($\mathcal{R}$), yielding the final structured data triple ($\mathcal{C}, \mathbf{E}, \mathcal{R}$).
  • Figure 3: An overview of the DEFT-LLM architecture. Three parallel, frozen expert encoders extract features for structure ($u_{struc}$), temporal dynamics ($u_{temp}$), and motion-semantics ($u_{sem}$) from the different facial cues. These features are projected into expert prefix tokens ($\mathcal{T}_{deft}$) and prepended to the text prompt ($\mathcal{T}_{tex}$). The combined sequence is then processed by a LoRA-tuned LLaMA 3.1 to generate the structured output.
  • Figure 4: Visualization of the 17 Defined Facial ROIs. This figure illustrates the 17 regions, which are segmented by applying convex hulls to groups of facial landmarks. The selection of these landmarks is based on prior anatomical knowledge. The final ROIs and their corresponding defining landmarks are shown overlaid on a facial image.
  • Figure 5: Effect of Motion Compensation. (a) Raw optical flow visualized in HSV, where a uniform color indicates dominant global head motion (e.g., upward-left). (b) Compensated flow for the same frame. The global motion is effectively nullified, revealing the subtle, localized motions corresponding to a facial expression (e.g., contraction around the eyes and mouth corners) as distinct color patterns against a static (black) background.
  • ...and 1 more figures