BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

Rutav Shah; Albert Yu; Yifeng Zhu; Yuke Zhu; Roberto Martín-Martín

BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

Rutav Shah, Albert Yu, Yifeng Zhu, Yuke Zhu, Roberto Martín-Martín

TL;DR

BUMBLE, a unified Vision-Language Model (VLM)-based framework integrating open-world RGB-D perception, a wide spectrum of gross-to-fine motor skills, and dual-layered memory is introduced, indicating that BUMBLE outperforms competitive baselines in long-horizon building-wide tasks that require sequencing up to 12 skills.

Abstract

To operate at a building scale, service robots must perform very long-horizon mobile manipulation tasks by navigating to different rooms, accessing different floors, and interacting with a wide and unseen range of everyday objects. We refer to these tasks as Building-wide Mobile Manipulation. To tackle these inherently long-horizon tasks, we introduce BUMBLE, a unified Vision-Language Model (VLM)-based framework integrating open-world RGBD perception, a wide spectrum of gross-to-fine motor skills, and dual-layered memory. Our extensive evaluation (90+ hours) indicates that BUMBLE outperforms multiple baselines in long-horizon building-wide tasks that require sequencing up to 12 ground truth skills spanning 15 minutes per trial. BUMBLE achieves 47.1% success rate averaged over 70 trials in different buildings, tasks, and scene layouts from different starting rooms and floors. Our user study demonstrates 22% higher satisfaction with our method than state-of-the-art mobile manipulation methods. Finally, we demonstrate the potential of using increasingly-capable foundation models to push performance further. For more information, see https://robin-lab.cs.utexas.edu/BUMBLE/

BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

TL;DR

Abstract

Paper Structure (22 sections, 7 figures, 2 tables)

This paper contains 22 sections, 7 figures, 2 tables.

Introduction
Related Work
BUMBLE: VLM-Based Building-Wide MoMa
Perception System, Skill Library, and Memory
Open-world Perception System
Skill Library
Memory
VLM-based Decision Making
Subtask prediction and skill selection
Skill parameter estimation
Experimental Evaluation
Conclusion
Acknowledgments
Appendix
Method details
...and 7 more sections

Figures (7)

Figure 1: Building-wide mobile manipulation. At a building-wide scale, mobile manipulation tasks involve sequencing multiple skills like pushing obstacles or opening doors to clear robot pathways, using elevators to reach a destination floor, rearranging chairs in the workspace, and retrieving objects. We present a framework, BUMBLE, that can solve long-horizon, spatially expansive tasks across different buildings.
Figure 2: Building-wide Mobile Manipulation tasks require navigating various rooms and floors while interacting with diverse objects. To solve such long-horizon tasks, BUMBLE leverages a diverse skill library enabling navigation and interaction, with parameterized skills adaptable to different scenes. At each skill execution step (circled numbers), BUMBLE uses VLM's reasoning capabilities to select the next skill (blue text) and skill parameters ([…]) and recover from failures (red text) to solve building-wide tasks effectively.
Figure 3: BUMBLE Architecture. Given a free-form text instruction, skill library (top Left), and short- and long-term memory (bottom middle and left), BUMBLE iteratively perceives the environment through onboard RBGD sensors (top left), predicts parameterized skills (top middle and right), and executes them in the environment. The predicted skill and its parameters are executed and stored in short-term memory for iterative prediction (bottom right).
Figure 4: Execution trace for a user instruction. For each decision step in the execution trace, we show the image observation with grounded skill parameters as markers, the skill name (blue text) and parameter executed, and a brief language description of the step (black) to improve readability. In trying to execute some skills, the robot fails (red text), in which case the VLM adaptively predicts the next skill.
Figure 5: Success rate (%) of VLMs in predicting skill parameters. The models are arranged with increasing capabilities, measured as per the Vision Arena chiang2024chatbot, from left to right in each model series.
...and 2 more figures

BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

TL;DR

Abstract

BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)