Table of Contents
Fetching ...

CoFineLLM: Conformal Finetuning of LLMs for Language-Instructed Robot Planning

Jun Wang, Yevgeniy Vorobeychik, Yiannis Kantaros

TL;DR

CoFineLLM tackles the unreliability of LLM-based planners in long-horizon robot tasks by integrating conformal prediction into the training loop. It introduces a loss that combines standard supervision with a CP-based regularizer and uses a calibration-driven threshold $ delta$ to simulate conformalization during finetuning, aided by LoRA and curriculum learning. Empirically, it achieves consistent reductions in prediction-set size and user-help rates while preserving CP coverage, and demonstrates robustness in out-of-distribution hardware scenarios. This approach enables more autonomous language-guided robotic planning with fewer human interventions and reliable probabilistic guarantees.

Abstract

Large Language Models (LLMs) have recently emerged as planners for language-instructed agents, generating sequences of actions to accomplish natural language tasks. However, their reliability remains a challenge, especially in long-horizon tasks, since they often produce overconfident yet wrong outputs. Conformal Prediction (CP) has been leveraged to address this issue by wrapping LLM outputs into prediction sets that contain the correct action with a user-defined confidence. When the prediction set is a singleton, the planner executes that action; otherwise, it requests help from a user. This has led to LLM-based planners that can ensure plan correctness with a user-defined probability. However, as LLMs are trained in an uncertainty-agnostic manner, without awareness of prediction sets, they tend to produce unnecessarily large sets, particularly at higher confidence levels, resulting in frequent human interventions limiting autonomous deployment. To address this, we introduce CoFineLLM (Conformal Finetuning for LLMs), the first CP-aware finetuning framework for LLM-based planners that explicitly reduces prediction-set size and, in turn, the need for user interventions. We evaluate our approach on multiple language-instructed robot planning problems and show consistent improvements over uncertainty-aware and uncertainty-agnostic finetuning baselines in terms of prediction-set size, and help rates. Finally, we demonstrate robustness of our method to out-of-distribution scenarios in hardware experiments.

CoFineLLM: Conformal Finetuning of LLMs for Language-Instructed Robot Planning

TL;DR

CoFineLLM tackles the unreliability of LLM-based planners in long-horizon robot tasks by integrating conformal prediction into the training loop. It introduces a loss that combines standard supervision with a CP-based regularizer and uses a calibration-driven threshold to simulate conformalization during finetuning, aided by LoRA and curriculum learning. Empirically, it achieves consistent reductions in prediction-set size and user-help rates while preserving CP coverage, and demonstrates robustness in out-of-distribution hardware scenarios. This approach enables more autonomous language-guided robotic planning with fewer human interventions and reliable probabilistic guarantees.

Abstract

Large Language Models (LLMs) have recently emerged as planners for language-instructed agents, generating sequences of actions to accomplish natural language tasks. However, their reliability remains a challenge, especially in long-horizon tasks, since they often produce overconfident yet wrong outputs. Conformal Prediction (CP) has been leveraged to address this issue by wrapping LLM outputs into prediction sets that contain the correct action with a user-defined confidence. When the prediction set is a singleton, the planner executes that action; otherwise, it requests help from a user. This has led to LLM-based planners that can ensure plan correctness with a user-defined probability. However, as LLMs are trained in an uncertainty-agnostic manner, without awareness of prediction sets, they tend to produce unnecessarily large sets, particularly at higher confidence levels, resulting in frequent human interventions limiting autonomous deployment. To address this, we introduce CoFineLLM (Conformal Finetuning for LLMs), the first CP-aware finetuning framework for LLM-based planners that explicitly reduces prediction-set size and, in turn, the need for user interventions. We evaluate our approach on multiple language-instructed robot planning problems and show consistent improvements over uncertainty-aware and uncertainty-agnostic finetuning baselines in terms of prediction-set size, and help rates. Finally, we demonstrate robustness of our method to out-of-distribution scenarios in hardware experiments.

Paper Structure

This paper contains 15 sections, 5 equations, 2 figures, 5 tables, 2 algorithms.

Figures (2)

  • Figure 1: Example environment from the BabyAI-Text simulator. The agent (red triangle) operates in a grid world with colored keys, balls, boxes, and walls, and receives NL mission (e.g., “pick up the yellow ball”). The simulator provides textual descriptions listing objects and their relative positions (e.g., “You see a yellow ball 1 step left”) for decision-making.
  • Figure 2: Hardware evaluation in a physical environment with out-of-distribution mission scenarios (sampled from ${\mathcal{D}}'\neq{\mathcal{D}})$. The robot receives the NL instruction “put the fire hydrant next to the red car.” The left panel shows the $3{\times}5$ grid abstraction used for planning, where each cell represents a discrete navigation step. Using this representation, the LLM planner generates the action plan $\tau$, which directs the robot to turn, navigate to the hydrant, pick it up, move to the red car, and place the hydrant next to it.