Table of Contents
Fetching ...

SELU: Self-Learning Embodied MLLMs in Unknown Environments

Boyu Li, Haobin Jiang, Ziluo Ding, Xinrun Xu, Haoran Li, Dongbin Zhao, Zongqing Lu

TL;DR

SELU tackles the problem of autonomous self-learning for multimodal LLMs operating in unknown embodied environments without external rewards. It introduces an actor-critic framework where the critic uses self-asking and hindsight relabeling to improve environmental grounding, and the actor is refined via critic-guided feedback; both components are updated with LoRA-based fine-tuning. The method is validated in AI2-THOR and VirtualHome, showing approximately 28–30% improvements in critic tasks and 20–24% improvements in actor decision-making across tasks, using two different MLLMs. This work demonstrates the feasibility of self-contained self-learning for embodied MLLMs and highlights the potential for stronger environment grounding, while noting limitations in trajectory-level evaluation and pointing to future work on finer critic signals and longer-horizon tasks.

Abstract

Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning.

SELU: Self-Learning Embodied MLLMs in Unknown Environments

TL;DR

SELU tackles the problem of autonomous self-learning for multimodal LLMs operating in unknown embodied environments without external rewards. It introduces an actor-critic framework where the critic uses self-asking and hindsight relabeling to improve environmental grounding, and the actor is refined via critic-guided feedback; both components are updated with LoRA-based fine-tuning. The method is validated in AI2-THOR and VirtualHome, showing approximately 28–30% improvements in critic tasks and 20–24% improvements in actor decision-making across tasks, using two different MLLMs. This work demonstrates the feasibility of self-contained self-learning for embodied MLLMs and highlights the potential for stronger environment grounding, while noting limitations in trajectory-level evaluation and pointing to future work on finer critic signals and longer-horizon tasks.

Abstract

Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning.
Paper Structure (25 sections, 4 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of our framework with other frameworks in terms of the feedback type.
  • Figure 2: The framework of SELU. (lower) The actor MLLM, represented as a robot, collects trajectories for the given instructions. (upper) The critic MLLM, denoted as a brain, evaluates these trajectories and determines whether they complete the tasks, guiding the update of the actor MLLM. In addition, the critic MLLM implements self-asking and hindsight relabeling to build a dataset for optimizing itself. The whole framework does not require any external feedback, such as environmental rewards or human annotations.
  • Figure 3: Hyperparameter study of SELU on picking up tasks in the AI2-THOR environment: (a) explores the size of the interaction dataset required for embodied MLLMs, (b) illustrates why a single MLLM is not suitable for SELU from the perspective of learning rare, and (c) demonstrates that the effect of multiple training iterations.
  • Figure 4: The diagram of experimental environments. We utilize the first-person perspective for decision-making and a third-person perspective for trajectory evaluation.
  • Figure 5: A visualization of the actor MLLM interacting with the AI2-THOR environment. The agent is instructed to pick up the lettuce. As the lettuce is far away, the agent needs to move closer before attempting to pick it up.
  • ...and 1 more figures