Table of Contents
Fetching ...

Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach

Atharva Gundawar, Karthik Valmeekam, Mudit Verma, Subbarao Kambhampati

TL;DR

A technical evaluation of a compound LLM architecture--the LLM-Modulo framework, which ensures that the system can never output any fallacious output, and therefore that every output generated is guaranteed correct--something previous techniques have not been able to claim.

Abstract

Previous work has attempted to boost Large Language Model (LLM) performance on planning and scheduling tasks through a variety of prompt engineering techniques. While these methods can work within the distributions tested, they are neither robust nor predictable. This limitation can be addressed through compound LLM architectures where LLMs work in conjunction with other components to ensure reliability. In this paper, we present a technical evaluation of a compound LLM architecture--the LLM-Modulo framework. In this framework, an LLM is paired with a complete set of sound verifiers that validate its output, re-prompting it if it fails. This approach ensures that the system can never output any fallacious output, and therefore that every output generated is guaranteed correct--something previous techniques have not been able to claim. Our results, evaluated across four scheduling domains, demonstrate significant performance gains with the LLM-Modulo framework using various models. Additionally, we explore modifications to the base configuration of the framework and assess their impact on overall system performance.

Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach

TL;DR

A technical evaluation of a compound LLM architecture--the LLM-Modulo framework, which ensures that the system can never output any fallacious output, and therefore that every output generated is guaranteed correct--something previous techniques have not been able to claim.

Abstract

Previous work has attempted to boost Large Language Model (LLM) performance on planning and scheduling tasks through a variety of prompt engineering techniques. While these methods can work within the distributions tested, they are neither robust nor predictable. This limitation can be addressed through compound LLM architectures where LLMs work in conjunction with other components to ensure reliability. In this paper, we present a technical evaluation of a compound LLM architecture--the LLM-Modulo framework. In this framework, an LLM is paired with a complete set of sound verifiers that validate its output, re-prompting it if it fails. This approach ensures that the system can never output any fallacious output, and therefore that every output generated is guaranteed correct--something previous techniques have not been able to claim. Our results, evaluated across four scheduling domains, demonstrate significant performance gains with the LLM-Modulo framework using various models. Additionally, we explore modifications to the base configuration of the framework and assess their impact on overall system performance.

Paper Structure

This paper contains 31 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: LLM-Modulo Framework for Scheduling domains. The loop begins with (1) generating the prompt using the problem specification by the prompt generator. Then (2) the prompt is passed to the LLM which returns a response. (3) The LLM response is sent to the format critic and if the format critic agrees then (4) it is sent to the constraint critics for checking validity. (5) If any critic disapproves of the response we send the critic feedback to the metacontroller. Then (6) the metacontroller consolidates the critics' evaluations and backprompts the LLM. (7) If all the critics approve of the response the framework returns it as a valid solution
  • Figure 2: Performance of models with direct prompting (lighter colors) and with LLM Modulo (darker colors) on Trip Planning across subsets with varying complexity. LLM Modulo is indicated by LM.
  • Figure 3: Effect of including context from previous iterations on model performance in the Calendar Scheduling domain using the GPT-4o-mini model. All values are rounded to the nearest integer
  • Figure 4: Comparison of Full Feedback, Binary Feedback, and First Feedback across different iterations in the calendar scheduling domain using GPT-4o-mini. Results are reported as performance percentages for each feedback type.
  • Figure 5: Accuracy improvement of GPT-4o-mini Modulo on calendar scheduling with zero-shot Chain-of-Thought prompting(blue), compared to baseline(red).
  • ...and 3 more figures