Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang; Fei Xia; Wenhao Yu; Andy Zeng; Montserrat Gonzalez Arenas; Maria Attarian; Maria Bauza; Matthew Bennice; Alex Bewley; Adil Dostmohamed; Chuyuan Kelly Fu; Nimrod Gileadi; Marissa Giustina; Keerthana Gopalakrishnan; Leonard Hasenclever; Jan Humplik; Jasmine Hsu; Nikhil Joshi; Ben Jyenis; Chase Kew; Sean Kirmani; Tsang-Wei Edward Lee; Kuang-Huei Lee; Assaf Hurwitz Michaely; Joss Moore; Ken Oslund; Dushyant Rao; Allen Ren; Baruch Tabanpour; Quan Vuong; Ayzaan Wahid; Ted Xiao; Ying Xu; Vincent Zhuang; Peng Xu; Erik Frey; Ken Caluwaerts; Tingnan Zhang; Brian Ichter; Jonathan Tompson; Leila Takayama; Vincent Vanhoucke; Izhak Shafran; Maja Mataric; Dorsa Sadigh; Nicolas Heess; Kanishka Rao; Nik Stewart; Jie Tan; Carolina Parada

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore, Ken Oslund, Dushyant Rao, Allen Ren, Baruch Tabanpour, Quan Vuong, Ayzaan Wahid, Ted Xiao, Ying Xu, Vincent Zhuang, Peng Xu, Erik Frey, Ken Caluwaerts, Tingnan Zhang, Brian Ichter, Jonathan Tompson, Leila Takayama, Vincent Vanhoucke, Izhak Shafran, Maja Mataric, Dorsa Sadigh, Nicolas Heess, Kanishka Rao, Nik Stewart, Jie Tan, Carolina Parada

TL;DR

The paper tackles the challenge of making LLMs more teachable for robot programming by combining fast online adaptation via in-context learning with slow offline improvement through fine-tuning, embodied in Language Model Predictive Control (LMPC). By framing human-robot interactions as a POMDP and training the model to predict future interaction rollouts, LMPC leverages model predictive control to search for short, effective teaching trajectories. Across 78 tasks and 5 robot embodiments, finetuned LMPC reduces the required human feedback from an average of $n=2.4$ turns to $n=1.9$ and boosts unseen-task success by 26.9%, with meta-learning gains up to 31.5% and additional improvements from top-user conditioning. The approach highlights practical gains in cross-embodiment generalization and real-world deployment potential, while noting computational and multimodal extensions as future work.

Abstract

Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are viewed as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions is training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

TL;DR

turns to

and boosts unseen-task success by 26.9%, with meta-learning gains up to 31.5% and additional improvements from top-user conditioning. The approach highlights practical gains in cross-embodiment generalization and real-world deployment potential, while noting computational and multimodal extensions as future work.

Abstract

Paper Structure (28 sections, 19 figures, 16 tables)

This paper contains 28 sections, 19 figures, 16 tables.

Introduction
Related Work
Language Model Predictive Control
Problem Statement
Fast Adaptation with In-Context Learning
Slow Adaptation with Model Fine-Tuning
Experiments
Data Collection and Evaluation
Robot Embodiments and Tasks
Compared Methods
Experiment Results
Discussions
Appendix
Data Collection and Evaluation Details
Additional Results
...and 13 more sections

Figures (19)

Figure 1: Code-writing large language models (LLMs) present opportunities for non-experts to teach robots new tasks with language -- enabled by fast adaptation via in-context learning (left). In this work, we fine-tune the underlying LLMs to further accelerate fast adaptation and improve their teachability (right). Results with human-robot interactions from non-experts teaching 5 robot embodiments on 78 tasks (gray) show that our framework (middle$^*$) can identify top performing users (purple), and leverage their interactions (only 14% of task coverage) to drive LLM performance improvements for all users (blue) -- measured in terms teaching success rates on unseen tasks, responsiveness to user feedback, and number of user corrections. Experiments show that these improvements generalize to new robot embodiments and APIs.
Figure 2: Our chat interface (left) allows non-experts to use language to teach robots new behaviors (visualized in simulation). Our LLM responds with reward code, to drive real-time motion control of a simulated or real robot. Statistics (right) show that base model data meets expectations: successful teaching sessions take fewer chat turns than failures, and task success rates correlate with fewer chat turns ($r=-0.85$) and higher good rating rates (i.e., responsiveness to feedback, $r=0.92$).
Figure 3: Given a dataset of users teaching robots new tasks with language (represented as text inputs and code outputs from online in-context learning -- left), LMPC-Rollouts is trained to predict subsequent inputs and outputs conditioned on the current chat history (middle), and uses MPC (receding horizon control) for inference-time search to return the next best action (with fewest expected corrections before success). LMPC-Skip is an alternate variant that is trained to directly predict the last action (right). Both LMPC variants accelerate fast robot adaptation via in-context learning.
Figure 4: Our fine-tuned LLMs with LMPC-Rollouts and LMPC-Skip improve the teachability of the base model (PaLM 2-S), and outperforms a RAG lewis2020retrieval baseline across all embodiments. LMPC-Skip overfits to train tasks (left), while LMPC-Rollouts generalizes better (i.e., more teachable and responsive to feedback) on unseen test tasks (right) for multi-turn sessions (with more than one chat turn).
Figure 5: Tasks evaluated in the real-world Mobile Manipulator and Robot Dog.
...and 14 more figures

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

TL;DR

Abstract

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Authors

TL;DR

Abstract

Table of Contents

Figures (19)