MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Junpeng Yue; Xinrun Xu; Börje F. Karlsson; Zongqing Lu

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Junpeng Yue, Xinrun Xu, Börje F. Karlsson, Zongqing Lu

TL;DR

The paper tackles the gap in embodied multimodal grounding by moving beyond surface-similarity trajectory retrieval. It proposes MART, which interactively learns to fine-tune an MLLM retriever using preference data derived from environment interactions, and introduces Trajectory Abstraction to summarize long trajectories into essential milestones. A Bradley-Terry-based scoring head is used to rank trajectories, enabling the agent to ground its actions on the most effective past experiences. Empirical results in AI2-THOR and LEGENT show consistent, significant improvements on unseen tasks, illustrating a practical, scalable paradigm for task-grounded, retrieval-augmented embodied agents.

Abstract

MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand. To address this issue, we propose a novel method, MLLM As ReTriever (MART), which enhances the performance of embodied agents by utilizing interaction data to fine-tune an MLLM retriever based on preference learning, such that the retriever fully considers the effectiveness of trajectories and prioritizes them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism that leverages MLLMs' summarization capabilities to represent trajectories with fewer tokens while preserving key information, enabling agents to better comprehend milestones in the trajectory. Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness. All the code for benchmark tasks, simulator modifications, and the MLLM retriever is available at https://github.com/PKU-RL/MART.

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

TL;DR

Abstract

Paper Structure (42 sections, 2 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 42 sections, 2 equations, 11 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Embodied Agents Based on Large Models
Memory Retrieval in Agents
Interactively Learning Multimodal Retrieval
Problem Formulation
Memory
Multimodal Retriever
Trajectory Abstraction
Experiments
Experimental Setup
Environments
Task Settings
Memory Construction
Task Evaluation
...and 27 more sections

Figures (11)

Figure 1: Similarity-Based Retriever vs. MART. Traditional multimodal retrieval methods (1) depend on calculating weighted sums of image and text embedding similarities, while our approach (2) introduces interactive learning to assess the relevance between the current and expert trajectories.
Figure 2: Scatter plots illustrating the relationship between success rate and embedding similarity (left) or effectiveness score (right) in two environments. The red line indicates a linear fit to the data.
Figure 3: Overview of MART. Our approach interactively learns a multimodal retriever to score expert trajectories and retrieve most effective trajectory to guide an agent in novel situations. By considering trajectories with higher success rates as positive samples and those with lower success rates as negative trajectories, we obtain the preference pairs, which are used to fine-tune an MLLM retriever to score trajectory effectiveness for a specific task.
Figure 4: Environment comparison.
Figure 5: Comparison between similarity-based retriever and MART.
...and 6 more figures

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

TL;DR

Abstract

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (11)