Subjective Evaluation Profile Analysis of Science Fiction Short Stories and its Critical-Theoretical Significance
Kazuyoshi Otsuka
TL;DR
The paper reframes large language models as subjective literary critics to examine whether they exhibit consistent evaluation profiles in literary judgment. Using ten original Japanese SF stories translated into English, six models were evaluated across seven sessions in a single day, with PCA, clustering, and TF-IDF analyses applied to scores and evaluative comments. The results reveal structured diversity in evaluation profiles, including a hierarchical consistency pattern (from $\alpha=1.00$ to $\alpha=0.35$), inter-story variance up to $4.5$-fold, and five evaluation clusters, all indicating model-specific, non-neutral evaluative tendencies. The findings argue for a probabilistic, multi-perspective form of AI literary criticism aligned with reader-response theory, offering a methodological framework for AI-assisted literary analysis and AI value-system archaeology.
Abstract
This study positions large language models (LLMs) as "subjective literary critics" to explore aesthetic preferences and evaluation patterns in literary assessment. Ten Japanese science fiction short stories were translated into English and evaluated by six state-of-the-art LLMs across seven independent sessions. Principal component analysis and clustering techniques revealed significant variations in evaluation consistency (α ranging from 1.00 to 0.35) and five distinct evaluation patterns. Additionally, evaluation variance across stories differed by up to 4.5-fold, with TF-IDF analysis confirming distinctive evaluation vocabularies for each model. Our seven-session within-day protocol using an original Science Fiction corpus strategically minimizes external biases, allowing us to observe implicit value systems shaped by RLHF and their influence on literary judgment. These findings suggest that LLMs may possess individual evaluation characteristics similar to human critical schools, rather than functioning as neutral benchmarkers.
