Synergistic Feature Fusion for Latent Lyrical Classification: A Gated Deep Learning Architecture
M. A. Gameiro
TL;DR
This work tackles the fusion of high-dimensional SBERT semantics with low-dimensional structural cues for lyric classification by introducing the Synergistic Fusion Layer (SFL), a gated fusion mechanism where $F_{SFL} = F_{deep} \odot G$ with $G=\sigma(W_G F_{struct} + b)$. Relative to a concatenation-based RF baseline, SFL delivers higher accuracy and dramatically improved calibration, achieving $ACC=0.9894$ and $ECE=0.00351$—a 93% reduction in miscalibrated confidence and a 2.5× improvement in Log Loss. The results demonstrate the value of non-linear, context-aware fusion for multimodal lyric analysis and emphasize the practical importance of reliable probability estimates in deployment. The approach holds potential for tighter integration of gating mechanisms within transformer-based architectures to enhance context-sensitive interpretation of structural metadata.
Abstract
This study addresses the challenge of integrating complex, high-dimensional deep semantic features with simple, interpretable structural cues for lyrical content classification. We introduce a novel Synergistic Fusion Layer (SFL) architecture, a deep learning model utilizing a gated mechanism to modulate Sentence-BERT embeddings (Fdeep) using low-dimensional auxiliary features (Fstruct). The task, derived from clustering UMAP-reduced lyrical embeddings, is reframed as binary classification, distinguishing a dominant, homogeneous cluster (Class 0) from all other content (Class 1). The SFL model achieved an accuracy of 0.9894 and a Macro F1 score of 0.9894, outperforming a comprehensive Random Forest (RF) baseline that used feature concatenation (Accuracy = 0.9868). Crucially, the SFL model demonstrated vastly superior reliability and calibration, exhibiting a 93% reduction in Expected Calibration Error (ECE = 0.0035) and a 2.5x lower Log Loss (0.0304) compared to the RF baseline (ECE = 0.0500; Log Loss = 0.0772). This performance validates the architectural hypothesis that non-linear gating is superior to simple feature concatenation, establishing the SFL model as a robust and trustworthy system for complex multimodal lyrical analysis.
