LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics
Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang
TL;DR
The paper investigates whether large language models capture latent linguistic features beyond surface text by constructing a multilingual genre dataset derived from Project Gutenberg and augmenting inputs with explicit features capturing syntax, metaphor, and metre. Through fine-tuning encoder models on three binary genre tasks across six languages, the study demonstrates that metre patterns provide the most consistent performance gains, while syntactic and metaphor signals yield selective improvements depending on task and language. The findings highlight the value of incorporating deeper linguistic signals into LLM training to improve genre classification and interpretability, with implications for cross-linguistic stylistic analysis. The work contributes a new multilingual resource and a framework for probing how latent linguistic structures influence NLP tasks beyond mere word usage.
Abstract
Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
