Decoding Speaker-Normalized Pitch from EEG for Mandarin Perception
Jiaxin Chen, Yiming Wang, Ziyu Zhang, Jiayang Han, Yin-Long Liu, Rui Feng, Xiuyuan Liang, Zhen-Hua Ling, Jiahong Yuan
TL;DR
This work tackles how the brain encodes Mandarin pitch without being tied to individual speakers’ pitch ranges by decoding $F_0$ contours from EEG. It introduces CE-ViViT, a multi-branch architecture combining Convolutional Embedding with ScConv and DANE, and a ViViT-based two-stage Feature Encoder to map EEG to pitch contours, trained with a mean-squared-error objective. Experiments on monosyllabic Mandarin stimuli show that speaker-normalized pitch contours are decoded more accurately in multi-speaker data, supporting the idea that neural pitch perception emphasizes relative pitch at the phoneme level; single-speaker data show no such advantage. The approach achieves performance comparable to state-of-the-art EEG-based regression methods, and the ablation study highlights the importance of temporal information and the embedding components for effective pitch reconstruction.
Abstract
The same speech content produced by different speakers exhibits significant differences in pitch contour, yet listeners' semantic perception remains unaffected. This phenomenon may stem from the brain's perception of pitch contours being independent of individual speakers' pitch ranges. In this work, we recorded electroencephalogram (EEG) while participants listened to Mandarin monosyllables with varying tones, phonemes, and speakers. The CE-ViViT model is proposed to decode raw or speaker-normalized pitch contours directly from EEG. Experimental results demonstrate that the proposed model can decode pitch contours with modest errors, achieving performance comparable to state-of-the-art EEG regression methods. Moreover, speaker-normalized pitch contours were decoded more accurately, supporting the neural encoding of relative pitch.
