Table of Contents
Fetching ...

QuickLAP: Quick Language-Action Preference Learning for Autonomous Driving Agents

Jordan Abi Nader, David Lee, Nathaniel Dennler, Andreea Bobu

TL;DR

QuickLAP tackles online reward learning for autonomous driving by fusing physical corrections and natural language in a principled Bayesian framework. It treats language as a probabilistic observation over latent rewards and uses dual-LLMs to produce a feature-attention mask $r$ and a reward-shift $\mu$ with confidence $m$, which are integrated with a conditional prior and a Boltzmann-like physical likelihood to yield a closed-form MAP update. The key result is a Kalman-like update $\hat{\theta}_i^{t+1}=\hat{\theta}_i^t+\frac{\sigma_{L,i}^2\Delta\Phi_i+\mu_i^t}{\Lambda_{prior,i}\sigma_{L,i}^2+1}$ that adapts to the reliability of language and the relevance of features, enabling fast, robust, real-time learning. Empirically, QuickLAP achieves large reductions in reward-inference error in simulated driving scenarios and gains higher perceived understandability and collaboration in a 15-participant user study, with code available at the project repository. This work advances multimodal human-robot interaction by providing a general framework for online, interpretable preference learning that leverages language to disambiguate grounded physical feedback.

Abstract

Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are grounded but ambiguous in intent, while language expresses high-level goals but lacks physical grounding. We introduce QuickLAP: Quick Language-Action Preference learning, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time. Our key insight is to treat language as a probabilistic observation over the user's latent preferences, clarifying which reward features matter and how physical corrections should be interpreted. QuickLAP uses Large Language Models (LLMs) to extract reward feature attention masks and preference shifts from free-form utterances, which it integrates with physical feedback in a closed-form update rule. This enables fast, real-time, and robust reward learning that handles ambiguous feedback. In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to physical-only and heuristic multimodal baselines. A 15-participant user study further validates our approach: participants found QuickLAP significantly more understandable and collaborative, and preferred its learned behavior over baselines. Code is available at https://github.com/MIT-CLEAR-Lab/QuickLAP.

QuickLAP: Quick Language-Action Preference Learning for Autonomous Driving Agents

TL;DR

QuickLAP tackles online reward learning for autonomous driving by fusing physical corrections and natural language in a principled Bayesian framework. It treats language as a probabilistic observation over latent rewards and uses dual-LLMs to produce a feature-attention mask and a reward-shift with confidence , which are integrated with a conditional prior and a Boltzmann-like physical likelihood to yield a closed-form MAP update. The key result is a Kalman-like update that adapts to the reliability of language and the relevance of features, enabling fast, robust, real-time learning. Empirically, QuickLAP achieves large reductions in reward-inference error in simulated driving scenarios and gains higher perceived understandability and collaboration in a 15-participant user study, with code available at the project repository. This work advances multimodal human-robot interaction by providing a general framework for online, interpretable preference learning that leverages language to disambiguate grounded physical feedback.

Abstract

Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are grounded but ambiguous in intent, while language expresses high-level goals but lacks physical grounding. We introduce QuickLAP: Quick Language-Action Preference learning, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time. Our key insight is to treat language as a probabilistic observation over the user's latent preferences, clarifying which reward features matter and how physical corrections should be interpreted. QuickLAP uses Large Language Models (LLMs) to extract reward feature attention masks and preference shifts from free-form utterances, which it integrates with physical feedback in a closed-form update rule. This enables fast, real-time, and robust reward learning that handles ambiguous feedback. In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to physical-only and heuristic multimodal baselines. A 15-participant user study further validates our approach: participants found QuickLAP significantly more understandable and collaborative, and preferred its learned behavior over baselines. Code is available at https://github.com/MIT-CLEAR-Lab/QuickLAP.

Paper Structure

This paper contains 35 sections, 20 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: (Top) User Study Setup. Participants controlled the virtual car using a gaming steering wheel. (Bottom) QuickLAP fuses physical corrections and natural language for improved reward learning. (Bottom-Left) A human's physical correction (blue) leads to an update that incorrectly changes the reward on multiple correlated features. (Bottom-Middle) QuickLAP combines this physical signal with concurrent language input, using an LLM to clarify intent and produce a more accurate reward update prioritizing cone avoidance. (Bottom-Right) In future scenarios, QuickLAP generates safer trajectories (orange) than the baseline (blue).
  • Figure 2: Example scenarios created from our four experimental scenarios for semi-autonomous driving: (a) Cone, (b) Cone + Puddle, (c) Cone + Puddle + Car (CPC-3), and (d) 4-lane Cone + Puddle + Car (CPC-4).
  • Figure 3: Comparison of adaptation methods across different environments for 4 interventions per episode. (a) Bars represent the Normalized MSE for each environment, averaged over 6 language inputs. Error bars indicate the mean $\pm$ standard error of the mean (SEM) calculated across the language inputs. (b) Solid lines represent the average NMSE. Shaded regions indicate the mean $\pm$ SEM, with the SEM calculated across the 4 environments. Lower NMSE indicates better performance.
  • Figure 4: User study results. All error bars represent standard error. (a) Average ratings for Understandability, Ease of Use, Predictability, and Collaborativity. Higher values are better. (b) Average ranking of each algorithm. Three corresponds to the most preferred algorithm, and one corresponds to the least preferred algorithm. (c) Normalized MSE between the learned vehicle behavior and the optimal vehicle behavior. Lower values are better.
  • Figure 5: Graphical model for QuickLAP. The robot optimizes with current reward parameters $\theta^t$, which generate a candidate trajectory $\xi_R$. The human has latent true reward parameters $\theta$. An attention mask $r^t$ (at time $t$) selects the feature subspace currently attended, introducing a latent shift $\mu$ between human and robot rewards. Given $\xi_R$, the human may provide a physical intervention $\xi_H$ under $\theta$, and a language utterance $l$ that directly informs $\mu$. Shaded nodes are observed ${\theta^t, \xi_R, \xi_H, l}$; unshaded nodes are latent ${\theta, r, \mu}$.
  • ...and 1 more figures