Table of Contents
Fetching ...

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi

TL;DR

Open-vocabulary mobile manipulation requires robust reasoning and perception under dynamic changes. The paper proposes BINDER, a dual-process framework combining an Instant Response Module (Video-LLM) for continuous monitoring and a Deliberative Response Module (multimodal LLM) for planning with updated 3D memory, linked by bidirectional coordination. This design mitigates temporal blindness while preserving geometric precision, delivering higher task success and path efficiency across three real-world environments compared with state-of-the-art baselines. Ablation and tabletop studies validate the complementary roles of DRM and IRM and demonstrate practical improvements for real-world OVMM deployment.

Abstract

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

TL;DR

Open-vocabulary mobile manipulation requires robust reasoning and perception under dynamic changes. The paper proposes BINDER, a dual-process framework combining an Instant Response Module (Video-LLM) for continuous monitoring and a Deliberative Response Module (multimodal LLM) for planning with updated 3D memory, linked by bidirectional coordination. This design mitigates temporal blindness while preserving geometric precision, delivering higher task success and path efficiency across three real-world environments compared with state-of-the-art baselines. Ablation and tabletop studies validate the complementary roles of DRM and IRM and demonstrate practical improvements for real-world OVMM deployment.

Abstract

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Limitations of existing OVMM approaches and our proposed BINDER. Robots are searching for a banana while exploring an unknown environment from navigation target $p_{0}$ to $p_{1}$. (a) Sparse-update approaches refresh perception only at navigation targets, leading to intermittent scene perception that leaves robots blind during traversal and causes them to miss objects that appear en-route. (b) Methods that perform more frequent updates at intermediate waypoints partially reduce this temporal blindness but require repeated vision-processing pauses for 3D reconstruction, introducing inefficiency and still leaving blind spots between update intervals. (c)BINDER instead maintains continuous visual awareness en-route via video-based monitoring and triggers 3D updates only when needed, enabling opportunistic detections (such as the banana appearing along the path) and task execution without intermittent pauses.
  • Figure 2: Illustration of dual-process reasoning in BINDER. Our proposed framework, BINDER, consists of two modules operating in parallel: Deliberate Response Module (DRM) and Instant Response Module (IRM). Based on the task instruction (inst.) and memory, DRM issues high-level actions (e.g., explore("black toy")) and guides IRM's attention. In parallel, IRM monitors the video stream in the background. When a task-relevant event occurs - such as opportunistically detecting the task-relevant object (6s) or diagnosing a grasp failure (21s) - IRM immediately generates a report, prompting DRM to replan for navigation or adjusting the grasp for manipulation. This bidirectional coordination enables both continuous responsiveness and adaptive planning, addressing the temporal blindness of prior OVMM systems.
  • Figure 3: Flowchart of dual-process execution in BINDER. (a) Pseudocode of the execution loop: the DRM issues high-level actions and task-specific guidance, while the IRM continuously monitors video and outputs execution modes (continue/adjust/replan) and object updates that drive local corrections or trigger replanning. (b) System overview: the DRM uses task instructions and memory to generate plans and guidance, while the IRM monitors environmental changes to update memory status and trigger timely replanning under dynamic conditions.
  • Figure 4: DRM-based frontier selection with top-$k$ candidate evaluation. The robot identifies top-$k$ frontier candidates $\{f_1, f_2, f_3\}$ from the exploration value map $V_i = V_i^T + V_i^S$, and obtains the corresponding camera views $I_{f_i}$ by orienting the camera toward each candidate. Given these views, the DRM evaluates $\text{DRM}(I_{f_i}, \mathcal{P}, \mathcal{M}_t)$ to determine which frontier is most promising for locating the target object; in this example, the DRM selects $f_1$ because the scene context (e.g., a refrigerator) suggests a higher likelihood of finding a banana nearby.
  • Figure 5: Hello Robot Stretch SE3 used in our experiments. Equipped with a mobile base, prismatic lift, 3-DoF wrist, and parallel-jaw gripper, the robot uses a head-mounted RealSense D435i for wide-view RGB-D observations (for exploration and 3D reconstruction) and a wrist-mounted RealSense D405 for accurate short-range depth during grasping. Low-level control and sensor streaming run on the onboard computer, while all LLM components (DRM and IRM) run on an external workstation over Wi-Fi; grasp poses generated by AnyGrasp are transformed into the robot frame and executed using Stretch’s inverse kinematics.
  • ...and 3 more figures