Table of Contents
Fetching ...

Foundation Models for Logistics: Toward Certifiable, Conversational Planning Interfaces

Yunhao Yang, Neel P. Bhatt, Christian Ellis, Samuel Li, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

TL;DR

The paper tackles the challenge of safe, interpretable logistics planning under uncertainty by introducing Vision-Language Logistics (VLL) agents that couple natural-language dialogue with real-time perceptual grounding and formal verification. A key innovation is the uncertainty-aware intent verification loop, which provides a probabilistic guarantee $\\hat{p}(y_t|z_t)$ based on latent-space distances $d_t$ to class centroids and calibration using $F_C$, enabling proactive clarifications when needed. The authors develop a three-stage VLL architecture, including perception, grounding to $r_t$ in $PDDL$, and a symbolic verifier, plus uncertainty-guided refinement using Direct Preference Optimization (DPO) and TextGrad prompts. In a lightweight airlift domain, a backbone model trained on as few as 100 samples, with calibration and refinement, outperforms a 20× larger model in goal classification while halving inference latency, illustrating that structured uncertainty signals and verification can deliver certifiable, user-aligned decisions at operational tempo.

Abstract

Logistics operators, from battlefield coordinators re-routing airlifts ahead of a storm to warehouse managers juggling late trucks, need to make mission-critical decisions. Prevailing methods for logistics planning such as integer programming yield plans that satisfy user-defined logical constraints, assuming an idealized mathematical model of the environment. On the other hand, foundation models lower the intermediate processing barrier by translating natural-language user utterances into executable plans, yet they remain prone to misinterpretations and hallucinations that jeopardize safety and cost. We introduce a Vision-Language Logistics (VLL) agent, built on a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on user-objective interpretation. The agent interprets user requests and converts them into structured planning specifications, quantifies the uncertainty of the interpretation, and invokes an interactive clarification loop when the uncertainty exceeds an adaptive threshold. Drawing on a lightweight airlift logistics planning use case as an illustrative case study, we highlight a practical path toward certifiable and user-aligned decision-making for complex logistics. Our lightweight model, fine-tuned on just 100 training samples, surpasses the zero-shot performance of 20x larger models in logistic planning tasks while cutting inference latency by nearly 50%.

Foundation Models for Logistics: Toward Certifiable, Conversational Planning Interfaces

TL;DR

The paper tackles the challenge of safe, interpretable logistics planning under uncertainty by introducing Vision-Language Logistics (VLL) agents that couple natural-language dialogue with real-time perceptual grounding and formal verification. A key innovation is the uncertainty-aware intent verification loop, which provides a probabilistic guarantee based on latent-space distances to class centroids and calibration using , enabling proactive clarifications when needed. The authors develop a three-stage VLL architecture, including perception, grounding to in , and a symbolic verifier, plus uncertainty-guided refinement using Direct Preference Optimization (DPO) and TextGrad prompts. In a lightweight airlift domain, a backbone model trained on as few as 100 samples, with calibration and refinement, outperforms a 20× larger model in goal classification while halving inference latency, illustrating that structured uncertainty signals and verification can deliver certifiable, user-aligned decisions at operational tempo.

Abstract

Logistics operators, from battlefield coordinators re-routing airlifts ahead of a storm to warehouse managers juggling late trucks, need to make mission-critical decisions. Prevailing methods for logistics planning such as integer programming yield plans that satisfy user-defined logical constraints, assuming an idealized mathematical model of the environment. On the other hand, foundation models lower the intermediate processing barrier by translating natural-language user utterances into executable plans, yet they remain prone to misinterpretations and hallucinations that jeopardize safety and cost. We introduce a Vision-Language Logistics (VLL) agent, built on a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on user-objective interpretation. The agent interprets user requests and converts them into structured planning specifications, quantifies the uncertainty of the interpretation, and invokes an interactive clarification loop when the uncertainty exceeds an adaptive threshold. Drawing on a lightweight airlift logistics planning use case as an illustrative case study, we highlight a practical path toward certifiable and user-aligned decision-making for complex logistics. Our lightweight model, fine-tuned on just 100 training samples, surpasses the zero-shot performance of 20x larger models in logistic planning tasks while cutting inference latency by nearly 50%.

Paper Structure

This paper contains 30 sections, 1 theorem, 13 equations, 5 figures, 2 tables.

Key Result

theorem 1

Given a new input, we compute its distance $d_t$ to the nearest centroid and compute the probabilistic guarantee $\Pr[y_i \ne y_c^*]$ is the number of other-class samples over the total number of samples, and $\Pr[ || z_i - c^* ||_2 \le d_t ]$ is the percentage of samples whose distance to $c^*$ is within $d_t$.

Figures (5)

  • Figure 1: Overview of an VLL agent. Language and visual inputs are converted into structured goals, filtered through an uncertainty‑aware verifier, and dispatched to symbolic planners.
  • Figure 2: Overview of the uncertainty-aware intent-verification loop.
  • Figure 3: The left plot is a latent space before learning. The middle plot is a learned space produced from a fine-tuned model, where three classes of user goals are separated. The right plot shows the calibration distributions of the three classes estimated via the learned latent space.
  • Figure 4: Comparison on goal classification accuracies and re-query frequencies. The fine-tuned VLL achieves better performance than all other baselines with a lower frequency of re-query, showing the effectiveness of uncertainty-guided fine-tuning in aligning model behavior with user goals.
  • Figure 5: Comparison between the baseline foundation models and our refined VLL agents. "VLL (combined)" refers to the fine-tuned GPT-4o-mini backbone with optimized prompts. The refined VLL with 4o-mini backbone outperforms the 20x larger models such as GPT-5.

Theorems & Definitions (4)

  • definition 1: Distance to Centroid
  • definition 2: Calibration Distribution
  • theorem 1: Latent Distance to Probabilistic Guarantee
  • proof