Table of Contents
Fetching ...

LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

Zhuoling Li, Xiaogang Xu, Zhenhua Xu, SerNam Lim, Hengshuang Zhao

TL;DR

This work tackles long-horizon embodied AI under compute constraints by addressing reward-vanishment in reinforcement learning. It introduces LARM, a lightweight auto-regressive model (<5B parameters) that directly outputs next actions and is trained with referee reinforcement learning, where a giant LLM (GPT-4) provides immediate auxiliary feedback to guide learning. The combination of a decoder-only backbone (TinyLLaVA-3.1B) with LoRA, CLIP-based multimodal inputs, and a GPT-4 referee enables a single model to accomplish diverse Minecraft tasks, including the first enchanted diamond equipment achievement, across MineDojo and Mineflayer environments. The results demonstrate strong generalization, superior task success, and practical online inference speed, highlighting a viable path to efficient, open-world embodied intelligence without per-task specialization or massive compute.

Abstract

Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods.

LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

TL;DR

This work tackles long-horizon embodied AI under compute constraints by addressing reward-vanishment in reinforcement learning. It introduces LARM, a lightweight auto-regressive model (<5B parameters) that directly outputs next actions and is trained with referee reinforcement learning, where a giant LLM (GPT-4) provides immediate auxiliary feedback to guide learning. The combination of a decoder-only backbone (TinyLLaVA-3.1B) with LoRA, CLIP-based multimodal inputs, and a GPT-4 referee enables a single model to accomplish diverse Minecraft tasks, including the first enchanted diamond equipment achievement, across MineDojo and Mineflayer environments. The results demonstrate strong generalization, superior task success, and practical online inference speed, highlighting a viable path to efficient, open-world embodied intelligence without per-task specialization or massive compute.

Abstract

Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods.
Paper Structure (14 sections, 6 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison among agents based on RL, LLM, and LARM. As shown, RL agents are usually task specialized, and LLM agents are computationally expensive to deploy. By contrast, the LARM agent is efficient and generalizable. Besides, LARM presents better performance. As shown, LARM is the first method that achieves enchanted diamond equipment in Minecraft.
  • Figure 2: The overall pipeline of our method. As illustrated, we parametrize the actor $\pi_a$ and critic $\pi_c$ using a single LARM model with two separate prediction heads, i.e., the action head and critic head. We train LARM based on our proposed referee RL algorithm, which utilizes both environment feedback and referee generated auxiliary reward to guide the optimization of LARM.
  • Figure 3: More behavior example illustrations of LARM, which include traveling a long distance to find a village, building a nether portal and then entering the nether, multiple agents collaborate with each other to combat zombies.