Table of Contents
Fetching ...

Creative Beam Search: LLM-as-a-Judge For Improving Response Generation

Giorgio Franceschelli, Mirco Musolesi

TL;DR

Creative Beam Search (CBS) addresses the gap between human intentional creativity and current LLM generation by combining a generate phase with Diverse Beam Search (DBS) and a validate phase via LLM-as-a-Judge. Grounded in Amabile's creativity framework, CBS uses DBS to produce diverse candidates and a self-evaluation step to select the preferred solution. Qualitative results with 31 graduate students show CBS is often perceived as more creative than standard sampling, with the self-evaluation step improving selection; however, some prompts yield outputs that are too similar to decide. Limitations include the lack of true intentionality, potential biases in self-evaluation, and computational costs, with future work aiming to broaden candidate pools and optimize prompt structures for improved co-creative performance.

Abstract

Large language models are revolutionizing several areas, including artificial creativity. However, the process of generation in machines profoundly diverges from that observed in humans. In particular, machine generation is characterized by a lack of intentionality and an underlying creative process. We propose a method called Creative Beam Search that uses Diverse Beam Search and LLM-as-a-Judge to perform response generation and response validation. The results of a qualitative experiment show how our approach can provide better output than standard sampling techniques. We also show that the response validation step is a necessary complement to the response generation step.

Creative Beam Search: LLM-as-a-Judge For Improving Response Generation

TL;DR

Creative Beam Search (CBS) addresses the gap between human intentional creativity and current LLM generation by combining a generate phase with Diverse Beam Search (DBS) and a validate phase via LLM-as-a-Judge. Grounded in Amabile's creativity framework, CBS uses DBS to produce diverse candidates and a self-evaluation step to select the preferred solution. Qualitative results with 31 graduate students show CBS is often perceived as more creative than standard sampling, with the self-evaluation step improving selection; however, some prompts yield outputs that are too similar to decide. Limitations include the lack of true intentionality, potential biases in self-evaluation, and computational costs, with future work aiming to broaden candidate pools and optimize prompt structures for improved co-creative performance.

Abstract

Large language models are revolutionizing several areas, including artificial creativity. However, the process of generation in machines profoundly diverges from that observed in humans. In particular, machine generation is characterized by a lack of intentionality and an underlying creative process. We propose a method called Creative Beam Search that uses Diverse Beam Search and LLM-as-a-Judge to perform response generation and response validation. The results of a qualitative experiment show how our approach can provide better output than standard sampling techniques. We also show that the response validation step is a necessary complement to the response generation step.
Paper Structure (13 sections, 3 figures, 1 table, 2 algorithms)

This paper contains 13 sections, 3 figures, 1 table, 2 algorithms.

Figures (3)

  • Figure 1: The Creative Beam Search method. Given a user prompt (step 0), DBS samples $K$ candidate solutions from a pre-trained language model (step 1). Then, $K$ evaluative prompts are composed by altering the order of the candidates and are passed to the model as inputs (step 2). The candidate with the most preferences is finally outputted.
  • Figure 2: The interface presented to the end-users during our experiment. After inserting a prompt with a creative request, two options are shown in a random order: the CBS output and the standard sampling output. The user is then asked to indicate which is the most creative in their opinion (or if the two options are too similar to decide).
  • Figure 3: Percentage of end-users' preferences comparing when CBS output is equal to DBS output and when it is not.