EMID: An Emotional Aligned Dataset in Audio-Visual Modality

Jialing Zou; Jiahao Mei; Guangze Ye; Tianyu Huai; Qiwei Shen; Daoguo Dong

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

Jialing Zou, Jiahao Mei, Guangze Ye, Tianyu Huai, Qiwei Shen, Daoguo Dong

TL;DR

EMID addresses the challenge of emotionally aligning music and images for cross-modal tasks by introducing a dataset that encodes affective information with a 13-dimension model and by deploying the EMI-Adapter to fuse emotional cues into a shared embedding space. Semantic alignment from ImageBind is combined with VA-space emotional distances through a weighted score T = $\alpha S + (1-\alpha) E$, enabling jointly meaningful music-image pairings and scalable data expansion to 32,214 pairs with rich annotations. Psychological experiments show that incorporating emotional information improves matching accuracy over semantic-only baselines, indicating stronger alignment with human perception and potential for psychotherapy applications and emotion-aware AIGC. The EMID framework thus contributes a practical resource and a lightweight integration module for affective cross-modal learning, with implications for affect-aware generation, retrieval, and therapeutic contexts.

Abstract

In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID.

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

TL;DR

, enabling jointly meaningful music-image pairings and scalable data expansion to 32,214 pairs with rich annotations. Psychological experiments show that incorporating emotional information improves matching accuracy over semantic-only baselines, indicating stronger alignment with human perception and potential for psychotherapy applications and emotion-aware AIGC. The EMID framework thus contributes a practical resource and a lightweight integration module for affective cross-modal learning, with implications for affect-aware generation, retrieval, and therapeutic contexts.

Abstract

Paper Structure (20 sections, 2 equations, 6 figures, 3 tables)

This paper contains 20 sections, 2 equations, 6 figures, 3 tables.

Introduction
RELATED WORK
Cross-modal Dataset
Cross-modal Task
The Proposed Dataset: EMID
Emotional Alignment Process
Acquisition of Emotional Features
Formation of Candidate Pairs
Injection of Emotional Information
Filtering and Expansion of Materials
Data Filtering
Data Expansion
Dataset Statistics
ASSESSMENT Experiment
Experimental Protocol
...and 5 more sections

Figures (6)

Figure 1: Defect of current cross-modal alignment approaches, ignoring the emotional relations of music and images.
Figure 2: The construction process of EMID includes the following steps. (1) Use ImageBind to get the semantic embedding for images and music as matrix $X$ and $Y$. Then we get the similarity score matrix through $S = Softmax(X\cdot{Y^{T}})$, (2) select $K$ images with the highest scores (in red color) from $i_{th}$ row of $S$ as the candidate image-set for the $i_{th}$ music, then calculate the cosine distance of each music clip and its image-candidates in emotional valence-arousal space, (3) generate EMID through weighted summation of semantic scores and emotional scores, remaining $m$ best-matching pairs, (4) design a lightweight component, EMI-Adapter, which can merge emotional features into the joint embedding space after training based on the EMID.
Figure 3: Valence-Arousal coordinates of eight emotional categories from the NRC-VAD-Lexicon mohammad2018obtaining dictionary and the final calculated values
Figure 4: The above curves display the variation of training loss and evaluation results in 50 iterations for three types of networks, with $@10$ indicating that top-10 images will be recommended. At this point, the EMID version used for training is taken as $m = 3$ and $\alpha = 0.6$.
Figure 5: Distribution of music-image pairs in the EMID without ($\alpha = 1$) and with ($\alpha = 0.6$) emotional alignment.
...and 1 more figures

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

TL;DR

Abstract

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

Authors

TL;DR

Abstract

Table of Contents

Figures (6)