Table of Contents
Fetching ...

NEWTON: Are Large Language Models Capable of Physical Reasoning?

Yi Ru Wang, Jiafei Duan, Dieter Fox, Siddhartha Srinivasa

TL;DR

This work introduces NEWTON, an integrated framework comprising a large object-attribute repository, a template-driven pipeline, and a 160K-question benchmark to evaluate large language models' physical reasoning about everyday objects. Grounded in 700+ objects and 8 physics attributes, it supports infinite, domain-adaptive prompt generation across three tracks—Foundational, Explicit, and Implicit—covering comprehension, application, and analysis. Empirical results show GPT-4 excels in scenario-based reasoning but exhibits limited object-attribute consistency relative to humans, while ablations demonstrate potential gains from fine-tuning, larger models, and prompt engineering. By providing a scalable, extensible toolkit for physically grounded reasoning, NEWTON enables training, evaluation, and deployment of LLMs in robotics and other real-world settings.

Abstract

Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io

NEWTON: Are Large Language Models Capable of Physical Reasoning?

TL;DR

This work introduces NEWTON, an integrated framework comprising a large object-attribute repository, a template-driven pipeline, and a 160K-question benchmark to evaluate large language models' physical reasoning about everyday objects. Grounded in 700+ objects and 8 physics attributes, it supports infinite, domain-adaptive prompt generation across three tracks—Foundational, Explicit, and Implicit—covering comprehension, application, and analysis. Empirical results show GPT-4 excels in scenario-based reasoning but exhibits limited object-attribute consistency relative to humans, while ablations demonstrate potential gains from fine-tuning, larger models, and prompt engineering. By providing a scalable, extensible toolkit for physically grounded reasoning, NEWTON enables training, evaluation, and deployment of LLMs in robotics and other real-world settings.

Abstract

Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io
Paper Structure (29 sections, 2 equations, 12 figures, 10 tables)

This paper contains 29 sections, 2 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: As works begin leveraging LLMs in physically grounded contexts, it is crucial to understand whether such models possess the ability to reason about everyday scenarios. NEWTON, a repository, pipeline, and benchmark, facilitates evaluation of various LLMs in a physical reasoning context.
  • Figure 2: NEWTON Pipeline. In addition to facilitating the curation of the NEWTON repository and benchmark, the NEWTON pipeline enables convenient extensibility of evaluation measures to suit any scenario. The pipeline consists of four main components, including annotation, filtering, template generation, and LLM evaluation. The annotation component starts with retrieving object categories from 3D object datasets, the categories of which will be filtered for irrelevancy, redundancy, and ambiguity, and matched with the WordNet Synset to remove overlapping categories. We then obtain the object-attribute templates after combination with the physical attributes, and conduct the crowdsourcing process. Each object-attribute sample has a minimum of four overlapping annotations, the agreement between which is used to filter the annotations and form the NEWTON repository of object-attribute pairs. The template generation step begins with a generic template, which will be filled through condition specification and object sampling. With the generated questions, we form a benchmark of 160K questions. The pipeline also enables formulation of infinite personalized evaluation prompts to suit any intended scenario.
  • Figure 3: Ablations. From left to right: A) BERT fine-tuning results using NEWTON, note the increase in accuracy on the unseen implicit questions after finetuning on NEWTON, using of sample of 5000, 10000, and 40000, respectively. B) Accuracy by question polarity, where positive polarity represents questions phrased with is, is more, and is the most, while negative polarity represents questions phrased with is not, is less, and is the least. C) Accuracy by position, where the position value indicates the placement of the correct answer within the sequence of possible options in the question template.
  • Figure 4: NEWTON Benchmark Track 1 data statistics. We highlight the data distribution of questions and attributes, total number of tokens per attribute, and number of attributes remaining after filtering for each object category.
  • Figure 5: NEWTON Benchmark Track 2 data statistics. We examine the data distribution for the explicit application track, and illustrate the percentage distribution of different attribute occurrences within the track, counts of question polarity and type, and total tokens for the questions in each attribute.
  • ...and 7 more figures