Table of Contents
Fetching ...

ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation

Yuhan Wu, Tiantian Wei, Shuo Wang, ZhiChao Wang, Yanyong Zhang, Daniel Cremers, Yan Xia

TL;DR

This work tackles long-horizon articulated-object manipulation by introducing ArtiBench, a large-scale benchmark with cross-part, cross-instance, cross-category, and long-horizon tasks across multiple household domains. It then presents ArtiBrain, a hierarchical framework that couples a VLM-based Task Reasoner with a Hybrid Controller and an Affordance Memory Bank to achieve robust, interpretable, and transferable manipulation policies. Empirical results in simulation and real-world settings show superior part-level generalization and success on complex multi-step tasks, outperforming state-of-the-art baselines. The approach emphasizes part-level affordance transfer and closed-loop reasoning to enable reliable manipulation across unseen articulated parts and configurations.

Abstract

Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.

ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation

TL;DR

This work tackles long-horizon articulated-object manipulation by introducing ArtiBench, a large-scale benchmark with cross-part, cross-instance, cross-category, and long-horizon tasks across multiple household domains. It then presents ArtiBrain, a hierarchical framework that couples a VLM-based Task Reasoner with a Hybrid Controller and an Affordance Memory Bank to achieve robust, interpretable, and transferable manipulation policies. Empirical results in simulation and real-world settings show superior part-level generalization and success on complex multi-step tasks, outperforming state-of-the-art baselines. The approach emphasizes part-level affordance transfer and closed-loop reasoning to enable reliable manipulation across unseen articulated parts and configurations.

Abstract

Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.

Paper Structure

This paper contains 14 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of ArtiBrain and ArtiBench. (Left) ArtiBrain performs long-horizon articulated manipulation via hierarchical reasoning and hybrid control. It is a hierarchical, closed-loop framework that integrates three key modules: a VLM-based Task Reasoner, a Hybrid Controller for both rigid and articulated actions, and an Affordance Memory Bank that accumulates verified part-level affordances to enhance transfer across parts and categories. (Right) ArtiBench provides 100+ articulated tasks and 400+ variations across four household scenarios and five generalization levels, enabling systematic evaluation from part-level variation to long-horizon multi-object manipulation.
  • Figure 2: Representative tasks from the four ArtiBench scenarios. Examples include disposing trash and organizing items in Storage, opening a refrigerator or oven in the Kitchen, manipulating drawers and laptops in the Office, and placing tools in the Tool setting. These tasks illustrate the diversity of everyday articulated interactions.
  • Figure 3: Architecture of our VLM-based Task Reasoner in ArtiBrain. Given a natural-language instruction and initial observation $I_0$, the VLM generates a structured plan of sub-tasks $(p_i, o_i)$ with corresponding success conditions $c_i$. The reasoning process ensures each action is executed and validated before progressing.
  • Figure 4: Architecture of the Hybrid Controller in our ArtiBrain. The controller integrates two branches: ArtiDiffusion for articulated object manipulation, employing a four-encoder architecture to extract point cloud features from $P_t$, encode robot state $S_t$, and process transferred affordance $\Phi^{\mathrm{tgt}} = (c^{\mathrm{tgt}}_{\mathrm{3D}}, \tau^{\mathrm{tgt}})$ obtained by geometrically aligning retrieved source affordance $\Phi^{\mathrm{src}} = (c^{\mathrm{src}}_{\mathrm{3D}}, \tau^{\mathrm{src}})$. The fused features condition a diffusion policy that generates action $\mathbf{a}_t$ through temporal U-Net denoising of noised sequence $\mathbf{a}^k$; GeoKeyframe for rigid objects, selecting optimal grasp pose $\mathbf{T}^\star$ and generating action $\mathbf{a}_t$ via geometric planning.
  • Figure 5: Results on ArtiBench. All numbers denote success rates (%), averaged over three random seeds. ArtiBrain achieves the best generalization performance across L1–L4 levels.
  • ...and 1 more figures