From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control
Yide Shentu, Philipp Wu, Aravind Rajeswaran, Pieter Abbeel
TL;DR
This work introduces Latent Codes as Bridges (LCB), a hierarchical control architecture that inserts a learnable <ACT> latent token to bridge a multimodal LLM and a fast low-level policy. By extracting the <ACT> embedding as a high-level latent goal and projecting it into the policy, LCB enables end-to-end fine-tuning without eroding the LLM’s language embeddings. The approach is validated on Language Table and CALVIN, where LCB achieves superior reasoning and long-horizon task performance compared to language-only interfaces and GPT-4V baselines. The results demonstrate effective integration of high-level language reasoning with low-level embodied control, offering a scalable path toward flexible, language-guided robotics.
Abstract
Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome these limitations. \method~uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that \method~outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.
