Monte Carlo Tree Search for Recipe Generation using GPT-2
Karan Taneja, Richard Segal, Richard Goodwin
TL;DR
The paper addresses unreliable long-range coherence in LLM-generated recipes by introducing RecipeMC, a decoding framework that uses Monte Carlo Tree Search over a fine-tuned GPT-2 to enforce soft constraints via handcrafted rewards. The approach combines $Z=20$ search iterations, PUCB-based token selection with $PUCB(i) = Q(i) + c \cdot p(x_i|x_{1:i-1}) \frac{\sqrt{N}}{n_i+1}$, top-$k$ expansion, and top-$p$ simulations to produce ingredient lists and instructions that align with the recipe name and content. Across Name→Ingr and Name+Ingr→Inst tasks, RecipeMC outperforms top-$p$ baselines on automatic metrics (coherence, $F_1$, ROUGE, BLEU, perplexity) and yields higher perceived human-likeness in a Recipe Turing Test, with preferences around 55% for ingredients and 62% for instructions. The work demonstrates that MCTS with simple reward signals can improve structured text generation without additional training, enabling flexible, constraint-driven recipe design and potential interactive collaboration with chefs and applications exposed via APIs.
Abstract
Automatic food recipe generation methods provide a creative tool for chefs to explore and to create new, and interesting culinary delights. Given the recent success of large language models (LLMs), they have the potential to create new recipes that can meet individual preferences, dietary constraints, and adapt to what is in your refrigerator. Existing research on using LLMs to generate recipes has shown that LLMs can be finetuned to generate realistic-sounding recipes. However, on close examination, these generated recipes often fail to meet basic requirements like including chicken as an ingredient in chicken dishes. In this paper, we propose RecipeMC, a text generation method using GPT-2 that relies on Monte Carlo Tree Search (MCTS). RecipeMC allows us to define reward functions to put soft constraints on text generation and thus improve the credibility of the generated recipes. Our results show that human evaluators prefer recipes generated with RecipeMC more often than recipes generated with other baseline methods when compared with real recipes.
