Multi-modal expressive personality recognition in data non-ideal audiovisual based on multi-scale feature enhancement and modal augment
Weixuan Kong, Jinpeng Yu, Zijun Li, Hanwei Liu, Jiqing Qu, Hui Xiao, Xuefeng Li
TL;DR
This work tackles automatic personality recognition from audiovisual data by introducing MsMA-Net, an end-to-end two-branch network that fuses visual and auditory features through cross-attention for improved interaction and discriminability. It couples a Multi-Scale Feature Enhancement Module (MSFEM) to capture information at multiple spatial scales with a Channel Attention mechanism, and a Modal Enhancement Strategy (MAS) that simulates non-ideal conditions during training to boost robustness against modal loss and noise. Empirical results on the ChaLearn First Impression dataset show state-of-the-art average Big Five personality accuracy, with ablations confirming the individual and combined contributions of MSFEM and MAS, and robustness analyses across six non-ideal scenarios demonstrating substantial resilience gains. The approach advances practical applicability of multimodal personality recognition and sets the stage for future tri-modal extensions that include textual information and other auxiliary cues.
Abstract
Automatic personality recognition is a research hotspot in the intersection of computer science and psychology, and in human-computer interaction, personalised has a wide range of applications services and other scenarios. In this paper, an end-to-end multimodal performance personality is established for both visual and auditory modal datarecognition network , and the through feature-level fusion , which effectively of the two modalities is carried out the cross-attention mechanismfuses the features of the two modal data; and a is proposed multiscale feature enhancement modalitiesmodule , which enhances for visual and auditory boththe expression of the information of effective the features and suppresses the interference of the redundant information. In addition, during the training process, this paper proposes a modal enhancement training strategy to simulate non-ideal such as modal loss and noise interferencedata situations , which enhances the adaptability ofand the model to non-ideal data scenarios improves the robustness of the model. Experimental results show that the method proposed in this paper is able to achieve an average Big Five personality accuracy of , which outperforms existing 0.916 on the personality analysis dataset ChaLearn First Impressionother methods based on audiovisual and audio-visual both modalities. The ablation experiments also validate our proposed , respectivelythe contribution of module and modality enhancement strategy to the model performance. Finally, we simulate in the inference phase multi-scale feature enhancement six non-ideal data scenarios to verify the modal enhancement strategy's improvement in model robustness.
