Table of Contents
Fetching ...

LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang

TL;DR

The paper investigates whether large language models capture latent linguistic features beyond surface text by constructing a multilingual genre dataset derived from Project Gutenberg and augmenting inputs with explicit features capturing syntax, metaphor, and metre. Through fine-tuning encoder models on three binary genre tasks across six languages, the study demonstrates that metre patterns provide the most consistent performance gains, while syntactic and metaphor signals yield selective improvements depending on task and language. The findings highlight the value of incorporating deeper linguistic signals into LLM training to improve genre classification and interpretability, with implications for cross-linguistic stylistic analysis. The work contributes a new multilingual resource and a framework for probing how latent linguistic structures influence NLP tasks beyond mere word usage.

Abstract

Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.

LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

TL;DR

The paper investigates whether large language models capture latent linguistic features beyond surface text by constructing a multilingual genre dataset derived from Project Gutenberg and augmenting inputs with explicit features capturing syntax, metaphor, and metre. Through fine-tuning encoder models on three binary genre tasks across six languages, the study demonstrates that metre patterns provide the most consistent performance gains, while syntactic and metaphor signals yield selective improvements depending on task and language. The findings highlight the value of incorporating deeper linguistic signals into LLM training to improve genre classification and interpretability, with implications for cross-linguistic stylistic analysis. The work contributes a new multilingual resource and a framework for probing how latent linguistic structures influence NLP tasks beyond mere word usage.

Abstract

Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.

Paper Structure

This paper contains 27 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our method. We extract various types of linguistic information from raw text and integrate them with the original sentences. The model is then trained to embed these enriched inputs uniformly into the latent space, thereby enhancing performance on genre classification tasks.
  • Figure 2: Detailed process of our method. We extract syntactic tree information and metrical patterns using two well-established natural language processing tools: spaCy-Benepar and PoetryTools. Metaphor counts are obtained using Metaphor RoBERTa, a state-of-the-art pretrained model designed to detect word-level metaphor usage. During training, we integrate the original sentence with the extracted linguistic features to enhance the model’s performance.
  • Figure 3: Syntactic tree analysis. The x-axis represents $\log(\text{depth\_ratio})$, while the y-axis represents $\log(\text{tree\_depth} + 1)$. Green dots indicate novels, blue denote poetry, and red represent drama. Subfigure (a) shows the plot for Poetry vs. Novel in English, which is clearly linearly separable. Subfigure (b) presents the Poetry vs. Novel contrast in French, revealing a more complex distribution. Subfigure (c) displays the Novel vs. Drama set in English, which also exhibits significant overlap and is difficult to separate.
  • Figure 4: Metre pattern analysis. We extract metre patterns from the raw texts and represent them as binary feature vectors, where each bit corresponds to a rhythmic unit. These vectors are then padded to uniform length and projected into a two-dimensional latent space using Principal Component Analysis (PCA).