Table of Contents
Fetching ...

A Solver-in-the-Loop Framework for Improving LLMs on Answer Set Programming for Logic Puzzle Solving

Timo Pierre Schrader, Lukas Lange, Tobias Kaminski, Simon Razniewski, Annemarie Friedrich

TL;DR

This work tackles the difficulty of generating correct Answer Set Programming (ASP) code with large language models by closing the loop with an ASP solver. It introduces a solver-in-the-loop framework that (i) samples partial ASP encodings, (ii) uses solver feedback to label selections for supervised fine-tuning, and (iii) applies solver-guided best-of-N sampling at test time with a carefully designed reward to improve robustness. The approach yields consistent gains across two prompting settings and two datasets (LogicPuzzles and GridPuzzles) for multiple open-weight LLMs, including notable improvements for smaller models after SFT and for GPT-4.1-mini under inference-time guidance. A key contribution is the combination of silver-standard training data generation from solver feedback and solver-based inference to steer ASP code generation without extra human annotations, enabling more reliable domain-specific coding with LLMs.

Abstract

The rise of large language models (LLMs) has sparked interest in coding assistants. While general-purpose programming languages are well supported, generating code for domain-specific languages remains a challenging problem for LLMs. In this paper, we focus on the LLM-based generation of code for Answer Set Programming (ASP), a particularly effective approach for finding solutions to combinatorial search problems. The effectiveness of LLMs in ASP code generation is currently hindered by the limited number of examples seen during their initial pre-training phase. In this paper, we introduce a novel ASP-solver-in-the-loop approach for solver-guided instruction-tuning of LLMs to addressing the highly complex semantic parsing task inherent in ASP code generation. Our method only requires problem specifications in natural language and their solutions. Specifically, we sample ASP statements for program continuations from LLMs for unriddling logic puzzles. Leveraging the special property of declarative ASP programming that partial encodings increasingly narrow down the solution space, we categorize them into chosen and rejected instances based on solver feedback. We then apply supervised fine-tuning to train LLMs on the curated data and further improve robustness using a solver-guided search that includes best-of-N sampling. Our experiments demonstrate consistent improvements in two distinct prompting settings on two datasets.

A Solver-in-the-Loop Framework for Improving LLMs on Answer Set Programming for Logic Puzzle Solving

TL;DR

This work tackles the difficulty of generating correct Answer Set Programming (ASP) code with large language models by closing the loop with an ASP solver. It introduces a solver-in-the-loop framework that (i) samples partial ASP encodings, (ii) uses solver feedback to label selections for supervised fine-tuning, and (iii) applies solver-guided best-of-N sampling at test time with a carefully designed reward to improve robustness. The approach yields consistent gains across two prompting settings and two datasets (LogicPuzzles and GridPuzzles) for multiple open-weight LLMs, including notable improvements for smaller models after SFT and for GPT-4.1-mini under inference-time guidance. A key contribution is the combination of silver-standard training data generation from solver feedback and solver-based inference to steer ASP code generation without extra human annotations, enabling more reliable domain-specific coding with LLMs.

Abstract

The rise of large language models (LLMs) has sparked interest in coding assistants. While general-purpose programming languages are well supported, generating code for domain-specific languages remains a challenging problem for LLMs. In this paper, we focus on the LLM-based generation of code for Answer Set Programming (ASP), a particularly effective approach for finding solutions to combinatorial search problems. The effectiveness of LLMs in ASP code generation is currently hindered by the limited number of examples seen during their initial pre-training phase. In this paper, we introduce a novel ASP-solver-in-the-loop approach for solver-guided instruction-tuning of LLMs to addressing the highly complex semantic parsing task inherent in ASP code generation. Our method only requires problem specifications in natural language and their solutions. Specifically, we sample ASP statements for program continuations from LLMs for unriddling logic puzzles. Leveraging the special property of declarative ASP programming that partial encodings increasingly narrow down the solution space, we categorize them into chosen and rejected instances based on solver feedback. We then apply supervised fine-tuning to train LLMs on the curated data and further improve robustness using a solver-guided search that includes best-of-N sampling. Our experiments demonstrate consistent improvements in two distinct prompting settings on two datasets.

Paper Structure

This paper contains 31 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We use the feedback from an ASP solver to assess LLM-generated ASP statements. (1) We take the problem description and let the LLM output $N$ alternatives for it (generate part of an ASP encoding). We then judge the correctness of these encodings using an ASP solver which checks for errors as well as for the expected answer size. (2) We incrementally combine the constraints of the problem into larger inputs to the LLM and obtain longer partial encodings (check part of an ASP encoding). (3) The ASP solver judges the correctness of each partial encoding containing the base encoding and at least one constraint. (4a) For training, we select instances from a chosen set that corresponds to instances with positive solver feedback. (4b) During inference, where the ground truth is not known, we score every partial encoding using a reward function that captures errors as well as the number of produced answer sets with preference for a lower number of returned answers. Additionally, we implement fallback mechanisms that regenerate encodings or backtrack in the graph if all alternatives were judged with a negative score.
  • Figure 2: Comparison of generated output tokens in relation to accuracy on LogicPuzzles between the DeepSeek Llama-70B, the untrained Llama-3.3 70B as well as the SFT-tuned one in combination with reward-based inference.
  • Figure 3: Matching two types of entities with four subjects each. There are $4 \times 3 \times 2 \times 1 = 4!$ possibilities to find exclusive matchings.
  • Figure 4: Visualization of an $3 \times 4$ grid puzzles that requires $2 = 3 - 1$ independent matchings with $4!$ possibilities in each subgraph.
  • Figure 5: Comparison of success and failure types between the base and SFT trained Llama-3.3 70B.
  • ...and 2 more figures