UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
Qiaojun Yu, Siyuan Huang, Xibin Yuan, Zhengkai Jiang, Ce Hao, Xin Li, Haonan Chang, Junbo Wang, Liu Liu, Hongsheng Li, Peng Gao, Cewu Lu
TL;DR
This work tackles the generalization gap in robotic manipulation by unifying tool usage and articulated-object understanding through a multimodal, language-guided framework. The authors introduce UniAff, which formulates a structured 3D representation of parts with 6D pose, rotated 2D bounding boxes, and affordance regions, and train it via a synthetic, richly labeled dataset spanning 1,500 objects across 19 articulated categories and 12 tool categories. By fine-tuning an SPHINX/LLaMA2-based MLLM with VQA-style prompts and leveraging high-resolution visual grounding, UniAff achieves significant improvements over state-of-the-art baselines on both tool and articulated-object tasks, in simulation and real-world experiments. The results demonstrate strong cross-domain generalization and establish UniAff as a general baseline for unified robotic manipulation tasks using vision-language reasoning and structured 3D representations.
Abstract
Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset, and code are published on the project website at:https://sites.google.com/view/uni-aff/home
