Table of Contents
Fetching ...

Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu, Gencheng Liu, Xinhai Chen, Siqi Shen, Songzhu Mei, Jie Liu

TL;DR

This work tackles the problem of generating psychologically grounded analyses from in-the-wild conversations by addressing Articulatory-Affective Ambiguity and the lack of robust evaluation. It introduces MIND, a hierarchical visual encoder that disentangles speech articulation from emotion, and PRISM, an automated, multi-dimensional evaluation framework, together with ConvoInsight-DB, a large, expert-annotated dataset. The approach yields substantial gains in micro-expression detection and psychological reasoning, with ablations confirming the critical role of the Status Judgment module and the multi-level fusion design. Collectively, the ecosystem enables more reliable, visually grounded psychological inference and paves the way for future improvements including audio integration and ethical safeguards.

Abstract

Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

TL;DR

This work tackles the problem of generating psychologically grounded analyses from in-the-wild conversations by addressing Articulatory-Affective Ambiguity and the lack of robust evaluation. It introduces MIND, a hierarchical visual encoder that disentangles speech articulation from emotion, and PRISM, an automated, multi-dimensional evaluation framework, together with ConvoInsight-DB, a large, expert-annotated dataset. The approach yields substantial gains in micro-expression detection and psychological reasoning, with ablations confirming the critical role of the Status Judgment module and the multi-level fusion design. Collectively, the ecosystem enables more reliable, visually grounded psychological inference and paves the way for future improvements including audio integration and ethical safeguards.

Abstract

Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

Paper Structure

This paper contains 29 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The motivation behind MIND (zoom in for detailed Q&A): (a) Current state-of-the-art large multimodal models (LMMs) for emotion recognition are inaccurate for video-based emotion recognition, unable to detect and analyze micro-expressions, and have limited ability to infer a person's psychological activities. (b) In contrast, MIND effectively addresses these limitations, enabling efficient psychological analysis and profiling of individuals. Incorrect outputs are marked in red; correct outputs are marked in green.
  • Figure 2: An overview of the MIND training process and model architecture. The FanEncoder module decouples expression features from the video, while the MicroExpressionEncoder extracts micro-expression features. The MultiLevelExpressionEncoder integrates micro-expression emotion features with macro-expression features. The detailed structure of the MicroExpressionEncoder and MultiLevelExpressionEncoder are shown in the figure to the right.
  • Figure 3: Main statistics of ConvoInsight-DB dataset.
  • Figure 4: Balanced performance of large multimodal models (LMMs).MIND-LLM-8B demonstrates superior performance across all evaluation dimensions. Evaluation details are shown in Table \ref{['tab3']}.