Table of Contents
Fetching ...

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, Ying-Cong Chen

TL;DR

PhysToolBench introduces a VQA benchmark with over 1,000 image-text pairs to quantify how well Multimodal Large Language Models understand physical tools across Easy, Medium, and Hard levels. It demonstrates that current models—from proprietary to backbones in Vision-Language-Action systems—struggle to match human tool understanding, especially in tool availability and visual reasoning, and shows limited gains from embodied fine-tuning. The authors analyze failures, reveal scale-related emergent abilities around ten billion parameters, and propose a Vision-Centric Reasoning framework to boost reasoning by grounding analysis in visual evidence. The work provides a practical, public benchmark and baseline, highlighting a crucial step toward truly embodied AI capable of robust tool use in the real world.

Abstract

The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

TL;DR

PhysToolBench introduces a VQA benchmark with over 1,000 image-text pairs to quantify how well Multimodal Large Language Models understand physical tools across Easy, Medium, and Hard levels. It demonstrates that current models—from proprietary to backbones in Vision-Language-Action systems—struggle to match human tool understanding, especially in tool availability and visual reasoning, and shows limited gains from embodied fine-tuning. The authors analyze failures, reveal scale-related emergent abilities around ten billion parameters, and propose a Vision-Centric Reasoning framework to boost reasoning by grounding analysis in visual evidence. The work provides a practical, public benchmark and baseline, highlighting a crucial step toward truly embodied AI capable of robust tool use in the real world.

Abstract

The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.

Paper Structure

This paper contains 19 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: For an Embodied Agent, using physical tools is crucial in many tasks. The understanding of physical tools significantly impacts the task's success rate and execution efficiency (Top). PhysToolBench (Bottom) systematically evaluates the understanding of physical tools of multimodal LLMs. The benchmark is designed with three progressive levels of difficulty and employs a Visual Question Answering (VQA) format. Notice that in the actual benchmark, tools in the images are numerically labeled, and images here are for illustrative purposes only.
  • Figure 2: Statistics of PhysToolBench. (a) is the distribution of the category. (b) is the distribution of the difficulty level. (c) is the word cloud of the task description given to MLLMs.
  • Figure 3: MLLM Leaderboard on our PhysToolBench, ranked by overall performance.
  • Figure 4: Overall performance v.s. model size for open-source MLLMs. A significant correlation is observed between performance and model size.
  • Figure 5: Performance comparison between the embodied models and their base model.
  • ...and 15 more figures