Table of Contents
Fetching ...

Event-Driven Proactive Assistive Manipulation with Grounded Vision-Language Planning

Fengkai Liu, Hao Su, Haozhuang Chi, Rui Geng, Congzhi Ren, Xuqing Liu, Yucheng Xu, Yuichi Ohsita, Liyun Zhang

Abstract

Assistance in collaborative manipulation is often initiated by user instructions, making high-level reasoning request-driven. In fluent human teamwork, however, partners often infer the next helpful step from the observed outcome of an action rather than waiting for instructions. Motivated by this, we introduce a shift from request-driven assistance to event-driven proactive assistance, where robot actions are initiated by workspace state transitions induced by human--object interactions rather than user-provided task instructions. To this end, we propose an event-driven framework that tracks interaction progress with an event monitor and, upon event completion, extracts stabilized pre/post snapshots that characterize the resulting state transition. Given the stabilized snapshots, the planner analyzes the implied state transition to infer a task-level goal and decide whether to intervene; if so, it generates a sequence of assistive actions. To make outputs executable and verifiable, we restrict actions to a set of action primitives and reference objects via integer IDs. We evaluate the framework on a real tabletop number-block collaboration task, demonstrating that explicit pre/post state-change evidence improves proactive completion on solvable scenes and appropriate waiting on unsolvable ones.

Event-Driven Proactive Assistive Manipulation with Grounded Vision-Language Planning

Abstract

Assistance in collaborative manipulation is often initiated by user instructions, making high-level reasoning request-driven. In fluent human teamwork, however, partners often infer the next helpful step from the observed outcome of an action rather than waiting for instructions. Motivated by this, we introduce a shift from request-driven assistance to event-driven proactive assistance, where robot actions are initiated by workspace state transitions induced by human--object interactions rather than user-provided task instructions. To this end, we propose an event-driven framework that tracks interaction progress with an event monitor and, upon event completion, extracts stabilized pre/post snapshots that characterize the resulting state transition. Given the stabilized snapshots, the planner analyzes the implied state transition to infer a task-level goal and decide whether to intervene; if so, it generates a sequence of assistive actions. To make outputs executable and verifiable, we restrict actions to a set of action primitives and reference objects via integer IDs. We evaluate the framework on a real tabletop number-block collaboration task, demonstrating that explicit pre/post state-change evidence improves proactive completion on solvable scenes and appropriate waiting on unsolvable ones.
Paper Structure (27 sections, 3 equations, 5 figures, 2 tables)

This paper contains 27 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Motivation: request-driven vs. event-driven assistance. Request-driven systems initiate planning from requests provided by the user kim2024openvlachen2025intention. Our event-driven setting instead triggers on human--object state transitions to infer the user’s goal and act without an additional user request.
  • Figure 2: System overview. The arm-side local stream monitors events, constructs an event payload, and executes grounded actions with verification; a cloud VLM performs event-level planning and returns an ID-indexed symbolic plan.
  • Figure 3: Qualitative example of event-driven proactive completion in the tabletop number-block task. The top row shows runtime snapshots (pre-event, human interaction, robot execution, and the final state). The middle row shows the system prompt. The bottom row shows the planner's event interpretation and the resulting structured plan, completing $2{+}3{=}5$ by placing "5" to the right of "=".
  • Figure 4: Quantitative results: outcome proportions. Each bar shows the fraction of successes and failures.
  • Figure 5: Failure taxonomy for solvable and unsolvable cases. Pick: unintended manipulation of a block that should remain fixed. Result: incorrect arithmetic completion. Ambiguity: insufficient evidence for a confident next assistive move. Place: final placement fails the desired spatial relation. Identification: perception-level misrecognition of symbols.