CoFineLLM: Conformal Finetuning of LLMs for Language-Instructed Robot Planning

Jun Wang; Yevgeniy Vorobeychik; Yiannis Kantaros

CoFineLLM: Conformal Finetuning of LLMs for Language-Instructed Robot Planning

Jun Wang, Yevgeniy Vorobeychik, Yiannis Kantaros

TL;DR

CoFineLLM tackles the unreliability of LLM-based planners in long-horizon robot tasks by integrating conformal prediction into the training loop. It introduces a loss that combines standard supervision with a CP-based regularizer and uses a calibration-driven threshold $ delta$ to simulate conformalization during finetuning, aided by LoRA and curriculum learning. Empirically, it achieves consistent reductions in prediction-set size and user-help rates while preserving CP coverage, and demonstrates robustness in out-of-distribution hardware scenarios. This approach enables more autonomous language-guided robotic planning with fewer human interventions and reliable probabilistic guarantees.

Abstract

Large Language Models (LLMs) have recently emerged as planners for language-instructed agents, generating sequences of actions to accomplish natural language tasks. However, their reliability remains a challenge, especially in long-horizon tasks, since they often produce overconfident yet wrong outputs. Conformal Prediction (CP) has been leveraged to address this issue by wrapping LLM outputs into prediction sets that contain the correct action with a user-defined confidence. When the prediction set is a singleton, the planner executes that action; otherwise, it requests help from a user. This has led to LLM-based planners that can ensure plan correctness with a user-defined probability. However, as LLMs are trained in an uncertainty-agnostic manner, without awareness of prediction sets, they tend to produce unnecessarily large sets, particularly at higher confidence levels, resulting in frequent human interventions limiting autonomous deployment. To address this, we introduce CoFineLLM (Conformal Finetuning for LLMs), the first CP-aware finetuning framework for LLM-based planners that explicitly reduces prediction-set size and, in turn, the need for user interventions. We evaluate our approach on multiple language-instructed robot planning problems and show consistent improvements over uncertainty-aware and uncertainty-agnostic finetuning baselines in terms of prediction-set size, and help rates. Finally, we demonstrate robustness of our method to out-of-distribution scenarios in hardware experiments.

CoFineLLM: Conformal Finetuning of LLMs for Language-Instructed Robot Planning

TL;DR

Abstract

CoFineLLM: Conformal Finetuning of LLMs for Language-Instructed Robot Planning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)