Table of Contents
Fetching ...

EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu, Chenzhuo Zhao, Zhibo Yang, Bin-Bin Yang, Feng Xiao

Abstract

Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.

EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Abstract

Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.

Paper Structure

This paper contains 53 sections, 12 equations, 16 figures, 15 tables, 1 algorithm.

Figures (16)

  • Figure 1: A reviewer hesitates over scoring a single story, with feedback that feels cryptic to the writer. When assessing two stories, scores are precise, yet the feedback, though acceptable, leaves the writer unsure how to revise. The writer then turns to readers. Some reach the same conclusions but offer different suggestions, which prove helpful, and the writer crafts better stories.
  • Figure 2: The EvolvR Framework. We self-synthesize a diverse set of CoT rationales via a multi-persona strategy, which are refined through a multi-agent evolution pipeline to ensure high quality, and the trained evaluator is deployed as a reward model to guide and enhance story generation.
  • Figure 3: Training loss curves for the Pointwise and Pairwise models trained via Supervised Fine-Tuning. Both models were trained using a standard cross-entropy loss objective.
  • Figure 4: Evolution of the average reward during the GRPO training process. The reward is designed to be 1 for perfect predictions and decay exponentially with error. The steady increase and eventual plateau of the reward.
  • Figure 5: Confusion matrices illustrating the score agreement between model predictions and ground-truth scores on the HANNA dataset. The left matrix represents the performance of the Qwen2.5-7B-Instruct model, while the right matrix represents our EvolvR model.
  • ...and 11 more figures