Table of Contents
Fetching ...

Subjective Evaluation Profile Analysis of Science Fiction Short Stories and its Critical-Theoretical Significance

Kazuyoshi Otsuka

TL;DR

The paper reframes large language models as subjective literary critics to examine whether they exhibit consistent evaluation profiles in literary judgment. Using ten original Japanese SF stories translated into English, six models were evaluated across seven sessions in a single day, with PCA, clustering, and TF-IDF analyses applied to scores and evaluative comments. The results reveal structured diversity in evaluation profiles, including a hierarchical consistency pattern (from $\alpha=1.00$ to $\alpha=0.35$), inter-story variance up to $4.5$-fold, and five evaluation clusters, all indicating model-specific, non-neutral evaluative tendencies. The findings argue for a probabilistic, multi-perspective form of AI literary criticism aligned with reader-response theory, offering a methodological framework for AI-assisted literary analysis and AI value-system archaeology.

Abstract

This study positions large language models (LLMs) as "subjective literary critics" to explore aesthetic preferences and evaluation patterns in literary assessment. Ten Japanese science fiction short stories were translated into English and evaluated by six state-of-the-art LLMs across seven independent sessions. Principal component analysis and clustering techniques revealed significant variations in evaluation consistency (α ranging from 1.00 to 0.35) and five distinct evaluation patterns. Additionally, evaluation variance across stories differed by up to 4.5-fold, with TF-IDF analysis confirming distinctive evaluation vocabularies for each model. Our seven-session within-day protocol using an original Science Fiction corpus strategically minimizes external biases, allowing us to observe implicit value systems shaped by RLHF and their influence on literary judgment. These findings suggest that LLMs may possess individual evaluation characteristics similar to human critical schools, rather than functioning as neutral benchmarkers.

Subjective Evaluation Profile Analysis of Science Fiction Short Stories and its Critical-Theoretical Significance

TL;DR

The paper reframes large language models as subjective literary critics to examine whether they exhibit consistent evaluation profiles in literary judgment. Using ten original Japanese SF stories translated into English, six models were evaluated across seven sessions in a single day, with PCA, clustering, and TF-IDF analyses applied to scores and evaluative comments. The results reveal structured diversity in evaluation profiles, including a hierarchical consistency pattern (from to ), inter-story variance up to -fold, and five evaluation clusters, all indicating model-specific, non-neutral evaluative tendencies. The findings argue for a probabilistic, multi-perspective form of AI literary criticism aligned with reader-response theory, offering a methodological framework for AI-assisted literary analysis and AI value-system archaeology.

Abstract

This study positions large language models (LLMs) as "subjective literary critics" to explore aesthetic preferences and evaluation patterns in literary assessment. Ten Japanese science fiction short stories were translated into English and evaluated by six state-of-the-art LLMs across seven independent sessions. Principal component analysis and clustering techniques revealed significant variations in evaluation consistency (α ranging from 1.00 to 0.35) and five distinct evaluation patterns. Additionally, evaluation variance across stories differed by up to 4.5-fold, with TF-IDF analysis confirming distinctive evaluation vocabularies for each model. Our seven-session within-day protocol using an original Science Fiction corpus strategically minimizes external biases, allowing us to observe implicit value systems shaped by RLHF and their influence on literary judgment. These findings suggest that LLMs may possess individual evaluation characteristics similar to human critical schools, rather than functioning as neutral benchmarkers.

Paper Structure

This paper contains 65 sections, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Evaluation consistency across models and sessions. See Appendix \ref{['appendix_F5.1']} for discussion of session-timing trade-offs.
  • Figure 2: Actual score distribution by story
  • Figure 3: Z-score distribution by story
  • Figure 4: Principal component analysis results
  • Figure 5: Clustering results
  • ...and 5 more figures