Table of Contents
Fetching ...

Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation

Shuyang Liu, Yuan Jin, Rui Lin, Shizhe Chen, Junyu Dai, Tao Jiang

TL;DR

Automated music aesthetic evaluation is challenged by the multidimensional nature of perception and scarce labeled data. HEAR addresses this by integrating three components: a multi-source multi-scale representations module, a hierarchical augmentation strategy, and a hybrid objective that combines regression and ranking losses, yielding robust aesthetic scores and reliable top-tier song identification. On the ICASSP 2026 SongEval benchmark, HEAR achieves consistent improvements over baselines across $LCC$, $SRCC$, $KTAU$, and $TTA$, driven by the jointly optimized objective $L_{total} = L_{SmoothL1} + \beta L_{ListMLE}$ with track-specific $\beta$ values. The approach effectively mitigates data scarcity and captures complex musical perception, and the authors provide public code and pretrained weights for practical deployment.

Abstract

Evaluating song aesthetics is challenging due to the multidimensional nature of musical perception and the scarcity of labeled data. We propose HEAR, a robust music aesthetic evaluation framework that combines: (1) a multi-source multi-scale representations module to obtain complementary segment- and track-level features, (2) a hierarchical augmentation strategy to mitigate overfitting, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-tier song identification. Experiments demonstrate that HEAR consistently outperforms the baseline across all metrics on both tracks of the ICASSP 2026 SongEval benchmark. The code and trained model weights are available at https://github.com/Eps-Acoustic-Revolution-Lab/EAR_HEAR.

Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation

TL;DR

Automated music aesthetic evaluation is challenged by the multidimensional nature of perception and scarce labeled data. HEAR addresses this by integrating three components: a multi-source multi-scale representations module, a hierarchical augmentation strategy, and a hybrid objective that combines regression and ranking losses, yielding robust aesthetic scores and reliable top-tier song identification. On the ICASSP 2026 SongEval benchmark, HEAR achieves consistent improvements over baselines across , , , and , driven by the jointly optimized objective with track-specific values. The approach effectively mitigates data scarcity and captures complex musical perception, and the authors provide public code and pretrained weights for practical deployment.

Abstract

Evaluating song aesthetics is challenging due to the multidimensional nature of musical perception and the scarcity of labeled data. We propose HEAR, a robust music aesthetic evaluation framework that combines: (1) a multi-source multi-scale representations module to obtain complementary segment- and track-level features, (2) a hierarchical augmentation strategy to mitigate overfitting, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-tier song identification. Experiments demonstrate that HEAR consistently outperforms the baseline across all metrics on both tracks of the ICASSP 2026 SongEval benchmark. The code and trained model weights are available at https://github.com/Eps-Acoustic-Revolution-Lab/EAR_HEAR.

Paper Structure

This paper contains 12 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of our proposed HEAR.