Contextual Gating within the Transformer Stack: Synergistic Feature Modulation for Enhanced Lyrical Classification and Calibration
M. A. Gameiro
TL;DR
This work introduces the ISFL module, a Contextual Gating mechanism inserted inside the Transformer encoder stack to fuse four structural lyrics-related features with deep semantic representations. By gating hidden states after the sixth layer of a BERT-based encoder, the model achieves state-of-the-art accuracy (0.9910) and Macro F1 (0.9910) while maintaining strong calibration (ECE ≈ 0.0081) and low log loss. The results demonstrate that mid-stack fusion of structural and semantic information yields superior discriminative power with reliable probability estimates, outperforming prior SFL and RF baselines. The findings support the broader potential of internal, mid-stack feature modulation for multimodal and context-rich NLP tasks, with future directions including dynamic injection points and cross-domain applications.
Abstract
This study introduces a significant architectural advancement in feature fusion for lyrical content classification by integrating auxiliary structural features directly into the self-attention mechanism of a pre-trained Transformer. I propose the SFL Transformer, a novel deep learning model that utilizes a Contextual Gating mechanism (an Intermediate SFL) to modulate the sequence of hidden states within the BERT encoder stack, rather than fusing features at the final output layer. This approach modulates the deep, contextualized semantic features (Hseq) using low-dimensional structural cues (Fstruct). The model is applied to a challenging binary classification task derived from UMAP-reduced lyrical embeddings. The SFL Transformer achieved an Accuracy of 0.9910 and a Macro F1 score of 0.9910, significantly improving the state-of-the-art established by the previously published SFL model (Accuracy 0.9894). Crucially, this Contextual Gating strategy maintained exceptional reliability, with a low Expected Calibration Error (ECE = 0.0081) and Log Loss (0.0489). This work validates the hypothesis that injecting auxiliary context mid-stack is the most effective means of synergistically combining structural and semantic information, creating a model with both superior discriminative power and high-fidelity probability estimates.
