Table of Contents
Fetching ...

COS(M+O)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models

Tobias Materzok

TL;DR

COS(M+O)S introduces a System 2–inspired storytelling framework that combines Monte Carlo Tree Search with a step-level value model and Odds Ratio Preference Optimization to iteratively explore and refine plot expansions generated by a 3B LLM. The method leverages a curiosity-driven, inverted-U Surprisal index and a coherence-focused value model to guide search and fine-tune the policy, achieving substantial gains over naive single-pass decoding and narrowing the gap to larger models on short-story tasks. Human and GPT-based evaluations corroborate that MCTS-discovered high-value trajectories are preferred and perceived as higher quality, albeit with modest absolute performance due to model size and data limitations. The work highlights a scalable path for improving creative writing with smaller models by embedding deliberate search, evaluation, and preference-aligned learning, while outlining practical constraints and future directions for larger backbones and richer evaluators.

Abstract

We present COS(M+O)S, a System 2-inspired framework for open-ended plot development that systematically explores the vast space of possible story expansions, enabling a 3B-parameter language model to approach the plot quality of a 70B model on select short-story tasks. The method accomplishes this by combining Monte Carlo Tree Search (MCTS), guided by a step-level value model that rewards moderate surprisal (curiosity) while penalizing incoherence, and Odds Ratio Preference Optimization (ORPO) to fine-tune the policy on high-value plot expansions. This iterative reinforcement learning loop systematically explores multiple candidate plot branches, backpropagates quality signals, and adapts the policy for faster convergence, notably shifting the policy from puzzle-based Chain-of-Thought to more character-driven storytelling. In small-scale tests with short-story prompts, 67%-77% of participants favored COS(M+O)S's highest-rated expansions over lower-rated ones, suggesting that our learned value function aligns. GPT-4o ratings further show that COS(M+O)S surpasses naive single-pass decoding from Llama 3.2 3B by 0.59 SD, coming within 0.06 SD of Llama 3.1 70B (no significant difference, p=0.93). Pairwise comparisons with o1 place COS(M+O)S 1.5 SD above the 3B baseline and find no statistically significant gap from 70B. Nevertheless, absolute story quality remains modest, constrained by the small model's capacity and limited training data.

COS(M+O)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models

TL;DR

COS(M+O)S introduces a System 2–inspired storytelling framework that combines Monte Carlo Tree Search with a step-level value model and Odds Ratio Preference Optimization to iteratively explore and refine plot expansions generated by a 3B LLM. The method leverages a curiosity-driven, inverted-U Surprisal index and a coherence-focused value model to guide search and fine-tune the policy, achieving substantial gains over naive single-pass decoding and narrowing the gap to larger models on short-story tasks. Human and GPT-based evaluations corroborate that MCTS-discovered high-value trajectories are preferred and perceived as higher quality, albeit with modest absolute performance due to model size and data limitations. The work highlights a scalable path for improving creative writing with smaller models by embedding deliberate search, evaluation, and preference-aligned learning, while outlining practical constraints and future directions for larger backbones and richer evaluators.

Abstract

We present COS(M+O)S, a System 2-inspired framework for open-ended plot development that systematically explores the vast space of possible story expansions, enabling a 3B-parameter language model to approach the plot quality of a 70B model on select short-story tasks. The method accomplishes this by combining Monte Carlo Tree Search (MCTS), guided by a step-level value model that rewards moderate surprisal (curiosity) while penalizing incoherence, and Odds Ratio Preference Optimization (ORPO) to fine-tune the policy on high-value plot expansions. This iterative reinforcement learning loop systematically explores multiple candidate plot branches, backpropagates quality signals, and adapts the policy for faster convergence, notably shifting the policy from puzzle-based Chain-of-Thought to more character-driven storytelling. In small-scale tests with short-story prompts, 67%-77% of participants favored COS(M+O)S's highest-rated expansions over lower-rated ones, suggesting that our learned value function aligns. GPT-4o ratings further show that COS(M+O)S surpasses naive single-pass decoding from Llama 3.2 3B by 0.59 SD, coming within 0.06 SD of Llama 3.1 70B (no significant difference, p=0.93). Pairwise comparisons with o1 place COS(M+O)S 1.5 SD above the 3B baseline and find no statistically significant gap from 70B. Nevertheless, absolute story quality remains modest, constrained by the small model's capacity and limited training data.

Paper Structure

This paper contains 67 sections, 9 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: MCTS-based story exploration and summarized (Chain-of-Thought) actions. Each edge denotes a candidate action proposed by the policy model, and each node is a partial story state. High-value expansions (orange) are explored more deeply. After search concludes, actions with high $Q$ serve as evidence for ORPO fine-tuning.
  • Figure 2: Heatmap of average F1 score for separating stories via the curiosity index, using group-aware stratified repeated k-fold cross-validation. The optimal surprisal peak shifts with model size: a smaller, "less read" SmolLM-360M allal_smollm_2024) exhibit a higher "optimal" surprisal level (around 10 bits), reflecting that it finds these stories more surprising than the bigger Phi-3.5 3B model.
  • Figure 3: Progression of maximum estimated plot quality $V_{\max}^{(\mathrm{final})}$ as a function of iterations $k$ and compute (PF-days) in Round 0 (base policy), Round 1 and Round 2 (fine-tuned policies). Individual experiments are plotted alongside the average fit and its 90% confidence interval. We estimate PF-days by multiplying the run time by 35 teraflops for a GPU operating at an assumed 30% utilization.
  • Figure 4: Log-linear fits (including 90% CI) of the average maximum estimated plot quality $V_{\max}^{(\mathrm{final})}$ across Rounds 0, 1, and 2 on a log scale of MCTS iterations $k$ (and corresponding PF-days of compute).
  • Figure 5: Visualization of the MCTS search tree of an experiment in Round 0 after 8 (a, top) and 80 (b, bottom) iterations. Dots represent story states, with the vertical axis denoting search depth and the horizontal axes serving only for layout. Colors correspond to action-value estimates $Q(s,a)$ (blue = lower, red = higher). The black line highlights the highest-value trajectory $V_{\max}^{(\mathrm{final})}$.
  • ...and 10 more figures