EMID: An Emotional Aligned Dataset in Audio-Visual Modality
Jialing Zou, Jiahao Mei, Guangze Ye, Tianyu Huai, Qiwei Shen, Daoguo Dong
TL;DR
EMID addresses the challenge of emotionally aligning music and images for cross-modal tasks by introducing a dataset that encodes affective information with a 13-dimension model and by deploying the EMI-Adapter to fuse emotional cues into a shared embedding space. Semantic alignment from ImageBind is combined with VA-space emotional distances through a weighted score T = $\alpha S + (1-\alpha) E$, enabling jointly meaningful music-image pairings and scalable data expansion to 32,214 pairs with rich annotations. Psychological experiments show that incorporating emotional information improves matching accuracy over semantic-only baselines, indicating stronger alignment with human perception and potential for psychotherapy applications and emotion-aware AIGC. The EMID framework thus contributes a practical resource and a lightweight integration module for affective cross-modal learning, with implications for affect-aware generation, retrieval, and therapeutic contexts.
Abstract
In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID.
