An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning
Navdeep Kaur, Lachlan McPheat, Alessandra Russo, Anthony G Cohn, Pranava Madhyastha
TL;DR
The paper tackles robustness of multi-step spatial reasoning in open-weight LLMs by integrating Conformal Language Modelling (CLM) with Answer Set Programming (ASP) to form conformal ASP program sets for StepGame. CLM delivers finite-sample guarantees on the conformal set's expected loss, formalized as $\mathbb{P}(\mathbb{E}[L(\mathcal{C}(X))]\leq \varepsilon) \geq 1-\delta$. Experiments on StepGame show CLM improves accuracy by at least 20 percentage points vs baselines; an LLM-as-Judge metric further boosts performance, while calibration diversity helps moderate tasks but not very long sequences (e.g., 15-hop). The work demonstrates a practical, interpretable neuro-symbolic reasoning pipeline with sampling guarantees that enhances reasoning reliability for structured outputs generated by LLMs.
Abstract
In this paper, we examine the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP) to enhance the performance of standard open-weight LLMs on complex multi-step reasoning tasks. Using the StepGame dataset, which requires spatial reasoning, we apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs. Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods, achieving substantial accuracy improvements across different levels of reasoning complexity. Additionally, the LLM-as-Judge metric enhances CLM's performance, especially in assessing structurally and logically correct ASP outputs. However, calibrating CLM with diverse calibration sets did not improve generalizability for tasks requiring much longer reasoning steps, indicating limitations in handling more complex tasks.
