A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics
Yuxin Li, Lorraine Xu, Meng Fan Wang
TL;DR
This study examines cross-genre authorship attribution for Chinese lyrics, addressing data scarcity by building a balanced 1,000-song dataset and a smaller Test2, and by fine-tuning a domain-specific Chinese-RoBERTa model. It compares this fine-tuned approach to zero-shot attribution via DeepSeek LLM with linguistically informed prompts, following PAN evaluation guidelines. The results show strong genre dependence, with Folklore & Tradition being highly discriminative, while other genres exhibit variable performance; fine-tuning offers robustness in real data (Test1) but yields limited gains on synthetic data (Test2). The work provides a first benchmark, reveals evaluation pitfalls, and suggests directions like larger diverse datasets, reduced token augmentation, balanced author representation, and domain-adaptive pretraining to improve attribution performance.
Abstract
We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore & Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love & Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance.
