BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Seongwon Cho; Daechul Ahn; Donghyun Shin; Hyeonbeom Choi; San Kim; Jonghyun Choi

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi

TL;DR

Open-vocabulary mobile manipulation requires robust reasoning and perception under dynamic changes. The paper proposes BINDER, a dual-process framework combining an Instant Response Module (Video-LLM) for continuous monitoring and a Deliberative Response Module (multimodal LLM) for planning with updated 3D memory, linked by bidirectional coordination. This design mitigates temporal blindness while preserving geometric precision, delivering higher task success and path efficiency across three real-world environments compared with state-of-the-art baselines. Ablation and tabletop studies validate the complementary roles of DRM and IRM and demonstrate practical improvements for real-world OVMM deployment.

Abstract

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

TL;DR

Abstract

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)