COS(M+O)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models
Tobias Materzok
TL;DR
COS(M+O)S introduces a System 2–inspired storytelling framework that combines Monte Carlo Tree Search with a step-level value model and Odds Ratio Preference Optimization to iteratively explore and refine plot expansions generated by a 3B LLM. The method leverages a curiosity-driven, inverted-U Surprisal index and a coherence-focused value model to guide search and fine-tune the policy, achieving substantial gains over naive single-pass decoding and narrowing the gap to larger models on short-story tasks. Human and GPT-based evaluations corroborate that MCTS-discovered high-value trajectories are preferred and perceived as higher quality, albeit with modest absolute performance due to model size and data limitations. The work highlights a scalable path for improving creative writing with smaller models by embedding deliberate search, evaluation, and preference-aligned learning, while outlining practical constraints and future directions for larger backbones and richer evaluators.
Abstract
We present COS(M+O)S, a System 2-inspired framework for open-ended plot development that systematically explores the vast space of possible story expansions, enabling a 3B-parameter language model to approach the plot quality of a 70B model on select short-story tasks. The method accomplishes this by combining Monte Carlo Tree Search (MCTS), guided by a step-level value model that rewards moderate surprisal (curiosity) while penalizing incoherence, and Odds Ratio Preference Optimization (ORPO) to fine-tune the policy on high-value plot expansions. This iterative reinforcement learning loop systematically explores multiple candidate plot branches, backpropagates quality signals, and adapts the policy for faster convergence, notably shifting the policy from puzzle-based Chain-of-Thought to more character-driven storytelling. In small-scale tests with short-story prompts, 67%-77% of participants favored COS(M+O)S's highest-rated expansions over lower-rated ones, suggesting that our learned value function aligns. GPT-4o ratings further show that COS(M+O)S surpasses naive single-pass decoding from Llama 3.2 3B by 0.59 SD, coming within 0.06 SD of Llama 3.1 70B (no significant difference, p=0.93). Pairwise comparisons with o1 place COS(M+O)S 1.5 SD above the 3B baseline and find no statistically significant gap from 70B. Nevertheless, absolute story quality remains modest, constrained by the small model's capacity and limited training data.
