Table of Contents
Fetching ...

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang

TL;DR

Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

Abstract

Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

TL;DR

Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

Abstract

Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

Paper Structure

This paper contains 23 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The motivation of injecting external knowledge. (a) LLMs may generate incorrect dependencies due to a lack of domain-specific knowledge in the virtual world; (b) Injecting external knowledge enables LLMs to generate more accurate responses.
  • Figure 2: The overview of VistaWise. VistaWise is based on an LLM and incorporates three graph-based processes: (1) text-modal graph construction, integrating external textual domain knowledge via a lightweight KG to establish factual dependencies and mitigate hallucinations; (2) cross-modal graph construction, embedding real-time visual information from a dedicated object detection model into the text-modal graph, forming a vision-text graph with dynamic visual attributes; (3) task-specific information retrieval, utilizing a retrieval-based pooling strategy to extract task-related information from the vision-text graph, guided by both the task-specific prompt and the real-time visual attributes of the graph. Furthermore, VistaWise comprises two interaction modules: (i) a desktop-level skill library, allowing the agent to act in the Minecraft desktop client via MNK operations, with action parameters generated autonomously by the LLM; (ii) a memory stack, storing and querying decision history to support reasoning. At each timestep, the agent makes decisions and executes actions based on the prompt, retrieved information, memory, and skill library, altering the game environment to advance the task.
  • Figure 3: Retrieval-based pooling. It first employs path searching pooling (PSP) to retain paths from the "Player" node to the task-specific "Target" node in the KG. Subsequently, entity matching pooling (EMP) preserves entities referenced in the task prompt and those with visual attributes in the dynamic vision-text graph. The pooled graph is textualized and input to the LLM, providing the agent with factual dependencies and real-time visual information.
  • Figure 4: Ablation study on information retrieval strategies.(Left) False Positive Rate (FPR), the proportion of redundant information in the retrieved results. (Right) False Negative Rate (FNR), the proportion of missed information to all that should have been retrieved. Lower FPR and FNR indicate better retrieval. "Similarity-based", "EMP", and "PSP" refer to the similarity-based strategy, entity matching pooling, and path searching pooling, respectively, while "EMP-PSP" and "PSP-EMP" denote their execution order.
  • Figure 5: The tokens consumed by our proposed agent to successfully achieve various goals. "VistaWise" is our standard framework proposed in Sec.\ref{['method']}, while "with FA" and "with V" indicate the addition of full graph attributes (Table \ref{['AEA']}) and the use of visual input (Table \ref{['LLMbackbone']}) to the standard framework, respectively.