Table of Contents
Fetching ...

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Yide Shentu, Philipp Wu, Aravind Rajeswaran, Pieter Abbeel

TL;DR

This work introduces Latent Codes as Bridges (LCB), a hierarchical control architecture that inserts a learnable <ACT> latent token to bridge a multimodal LLM and a fast low-level policy. By extracting the <ACT> embedding as a high-level latent goal and projecting it into the policy, LCB enables end-to-end fine-tuning without eroding the LLM’s language embeddings. The approach is validated on Language Table and CALVIN, where LCB achieves superior reasoning and long-horizon task performance compared to language-only interfaces and GPT-4V baselines. The results demonstrate effective integration of high-level language reasoning with low-level embodied control, offering a scalable path toward flexible, language-guided robotics.

Abstract

Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome these limitations. \method~uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that \method~outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

TL;DR

This work introduces Latent Codes as Bridges (LCB), a hierarchical control architecture that inserts a learnable <ACT> latent token to bridge a multimodal LLM and a fast low-level policy. By extracting the <ACT> embedding as a high-level latent goal and projecting it into the policy, LCB enables end-to-end fine-tuning without eroding the LLM’s language embeddings. The approach is validated on Language Table and CALVIN, where LCB achieves superior reasoning and long-horizon task performance compared to language-only interfaces and GPT-4V baselines. The results demonstrate effective integration of high-level language reasoning with low-level embodied control, offering a scalable path toward flexible, language-guided robotics.

Abstract

Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome these limitations. \method~uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that \method~outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.
Paper Structure (12 sections, 1 equation, 7 figures, 2 tables)

This paper contains 12 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of our proposed Latent Code as Bridges architecture. Given a high-level task description and the observation, a Large Language Model (LLM) generates a textual description of an action and an <ACT> token. The feature embedding from the <ACT> token's last layer serves as a high-level latent goal for the downstream policy network. Our modular hierarchical approach synergies the LLM's high-level reasoning with the pre-trained policy's responsive low-level control, addressing the limitations of direct action output by monolithic LLMs. Unlike methods that using a large LLM to directly output agent actions brohan2023rt2, our approach can run the LLM reasoning and action policy execution loops asynchronously, mirroring human task execution with immediate low-level feedback when interacting with the physical world and slower, deliberate reasoning when considering longer term planning. At test time, the action policy frequently updates actions based on environment changes, while the LLM updates are less frequent, enabling efficient, real-world inference.
  • Figure 2: A high level architectural comparison of LLM-based hierarchical policies. Predefined skills (left) uses a LLM to call predefined primitives. Language as an interface (middle) uses a LLM to output a simple language command, which is then passed into a language conditioned policy. LCB (right) utilizes a latent code as a bridge between the LLM and the low level policy, facilitating hierarchical control and end-to-end learning.
  • Figure 3: A visualization of the two environments along with exemplar tasks that we train and evaluate on. The top depicts the Language Table environment Lynch2022-na. We study reasoning tasks (first trajectory) and long horizon tasks (second trajectory). The bottom depicts the CALVIN long horizon benchmark mees2022calvin, in which the agent must sequentially accomplish tasks.
  • Figure 4: Task success rates on Language table. The tasks are drawn from the higher level Language Table tasks from PALM-E Driess2023-lm. LangTable refers to the original language table policy Lynch2022-na. +LLaVA (frozen) refers to composing the original language table with a frozen LLaVA model and few shot prompting. +GPT-4V similarly refers to composing the original policy with GPT-4V. +LLaVA (finetuned) refers to finetuning the LLaVA policy on our mixture dataset on the language only, then composing it with the policy. Our results show that leveraging LCB is effective on tasks that require additional reasoning and planning. Note that the same model is evaluated between the long horizon and reasoning tasks.
  • Figure 5: A comparison of the flow from a high level language task to the policy for different approaches. (Left) LangTable + GPT-4V requires a prompt to understand the task and desired output format. GPT-4V can provide language reasoning to allow the user to introspect the decision process of the language model, but requires additional parsing to extract the relevant language instruction to provide to the model. (Middle) LangTable + LLaVA (Fine-tuned) fine-tunes the language model to output the exact language instruction as in the training data, effectively acting as a language interface converter. This approach, while effective, removes the chat like capability from the language model. (Right) LCB fine-tunes the language model with a chat like interface and action token. The policy is directly conditioned on the latent feature from the action token provided by the model, enabling effective policy conditioning without losing the chat like language model interface.
  • ...and 2 more figures