Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Sipeng Zheng; Jiazheng Liu; Yicheng Feng; Zongqing Lu

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu

TL;DR

This work tackles the gap in open-world embodied AI by integrating visual perception with a pre-trained LLM to form Steve-Eye, a large multimodal model capable of multimodal perception, knowledge grounding, and skill planning. It introduces an 850K open-world instruction-following dataset spanning multimodal perception, foundational knowledge, and skill-related interactions, and employs a two-stage instruction-tuning strategy to align visual features with language before end-to-end tuning. Through three open-world benchmarks in Minecraft—Environmental Visual Captioning (ENV-VC), Foundational Knowledge QA (FK-QA), and Skill Prediction and Planning (SPP)—Steve-Eye outperforms text-only baselines and demonstrates robust multimodal generation and planning abilities, with ablations illustrating the value of the visual encoder and data components. The results indicate Steve-Eye’s potential to enhance real-world open-world agents, with future work aimed at broader environments and deployment scenarios.

Abstract

Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game." Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation. Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback. In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan. Codes and datasets will be released.

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 14 figures, 7 tables)

This paper contains 24 sections, 3 equations, 14 figures, 7 tables.

Introduction
Related Work
Open-world Embodied Agents with LLMs
Large Multimodal Models (LMMs)
Methodology
Open-World Instruction-Following Dataset
Model Architecture
Training
Experiments
Experimental Setup
Environmental Visual Captioning (ENV-VC)
Foudational Knowledge Question Answering (FK-QA)
Skill Prediction and Planning (SPP)
Conclusion
Appendix
...and 9 more sections

Figures (14)

Figure 1: (a) LLM-based agent's feedback is uncontrollable due to the uncertainty of input textual prompt, while visual cues can benefit the agent to generate feedbacks; (b) a text-only driven agent often finds it difficult to produce intuitive feedback that humans can easily understand.
Figure 2: Multimodal perception
Figure 3: Icons and recipes
Figure 4: Illustration of Steve-Eye: a large multimodal model designed to seamlessly process both visual and language inputs. Steve-Eye excels in acquiring fundamental knowledge of the world it lives in, understanding the nuances of its surroundings, and generating executable plans to complete a wide array of open-ended tasks. Furthermore, Steve-Eye responds to user instructions through either visual or text-based cues, enhancing the convenience and flexibility of human-AI interaction.
Figure 5: Snapshots of a qualitative example, illustrating how Steve-Eye completes the task of "crafting a stone axe with a wooden pickaxe." Our model generates a skill plan at each interaction round and selects the top skill from the plan list for execution.
...and 9 more figures

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

TL;DR

Abstract

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Authors

TL;DR

Abstract

Table of Contents

Figures (14)