UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

Qiaojun Yu; Siyuan Huang; Xibin Yuan; Zhengkai Jiang; Ce Hao; Xin Li; Haonan Chang; Junbo Wang; Liu Liu; Hongsheng Li; Peng Gao; Cewu Lu

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

Qiaojun Yu, Siyuan Huang, Xibin Yuan, Zhengkai Jiang, Ce Hao, Xin Li, Haonan Chang, Junbo Wang, Liu Liu, Hongsheng Li, Peng Gao, Cewu Lu

TL;DR

This work tackles the generalization gap in robotic manipulation by unifying tool usage and articulated-object understanding through a multimodal, language-guided framework. The authors introduce UniAff, which formulates a structured 3D representation of parts with 6D pose, rotated 2D bounding boxes, and affordance regions, and train it via a synthetic, richly labeled dataset spanning 1,500 objects across 19 articulated categories and 12 tool categories. By fine-tuning an SPHINX/LLaMA2-based MLLM with VQA-style prompts and leveraging high-resolution visual grounding, UniAff achieves significant improvements over state-of-the-art baselines on both tool and articulated-object tasks, in simulation and real-world experiments. The results demonstrate strong cross-domain generalization and establish UniAff as a general baseline for unified robotic manipulation tasks using vision-language reasoning and structured 3D representations.

Abstract

Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset, and code are published on the project website at:https://sites.google.com/view/uni-aff/home

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

TL;DR

Abstract

Paper Structure (17 sections, 5 figures, 4 tables)

This paper contains 17 sections, 5 figures, 4 tables.

Introduction
Related Work
Method
Formulation of Structured Manipulation Task
Synthetic Data Generation
Tools
Articulated Objects
VQA Design
MLLMs-based Manipulation and Model Fine-tuning
Experiments
Experimental Settings
Robotic Affordance Detection Result
Tool Usage Understanding Evaluation
Articulation Manipulation Evaluation
Ablation Studies
...and 2 more sections

Figures (5)

Figure 1: UniAff demonstrates its ability to unify tool usage and articulation understanding in a VQA format, predicting part bounding boxes, 6D poses, grasp affordances, functional affordances, and manipulation types, etc for effective robotic manipulation tasks.
Figure 2: The architecture of UniAff. The image features are first extracted using a Mixed Visual Encoder, such as DINOv2, CLIP, or Q-Former, followed by an MLP projector. Next, language instructions are used to extract features with the Llama Tokenizer. Finally, the output of the structured manipulation tasks, such as Part BBOX, Affordance, and Revolute Parts, is used to execute robotic instructions.
Figure 3: Illustration of tools. The blue box indicates grasp affordance, the red box indicates functional affordance and the orientation axis illustrates the object's pose.
Figure 4: Illustration of manipulation types.(a) bottle cap, (b) revolute part, (c) sliding lid, (d) prismatic part. The yellow box represents the object part, the blue box indicates grasp affordance, the red arrow marks the joint parameter, and the green arrow illustrates the manipulation trajectory.
Figure 5: Implementation of UniAff in real-world experiments progressed from tool manipulation to articulated object interaction, encompassing tasks such as striking a designated target with a hammer, opening a drawer, refrigerator, microwave, pot, and lifting a bucket handle.

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

TL;DR

Abstract

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)