Table of Contents
Fetching ...

PhyCritic: Multimodal Critic Models for Physical AI

Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu

TL;DR

PhyCritic introduces a physics-aware multimodal critic for physical AI by coupling a two-stage RLVR pipeline with self-referential critic finetuning, grounding judgments in the model's own physical reasoning. It defines a formal task framework and builds a dedicated PhyCritic-Bench to evaluate physical-domain judging, achieving state-of-the-art open-source performance on physical judgments and strong generalization to general multimodal evaluation. The approach demonstrates data efficiency, improved stability, and enhanced physical reasoning when used as both a judge and a policy signal for downstream tasks. This work advances reliable, physics-grounded evaluation in embodied AI and sets the stage for broader, physics-aware multimodal critique systems.

Abstract

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

PhyCritic: Multimodal Critic Models for Physical AI

TL;DR

PhyCritic introduces a physics-aware multimodal critic for physical AI by coupling a two-stage RLVR pipeline with self-referential critic finetuning, grounding judgments in the model's own physical reasoning. It defines a formal task framework and builds a dedicated PhyCritic-Bench to evaluate physical-domain judging, achieving state-of-the-art open-source performance on physical judgments and strong generalization to general multimodal evaluation. The approach demonstrates data efficiency, improved stability, and enhanced physical reasoning when used as both a judge and a policy signal for downstream tasks. This work advances reliable, physics-grounded evaluation in embodied AI and sets the stage for broader, physics-aware multimodal critique systems.

Abstract

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
Paper Structure (15 sections, 6 equations, 4 figures, 12 tables)

This paper contains 15 sections, 6 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: PhyCritic first produces its own physics-aware reasoning and prediction, then explicitly applies it as reference in judging a pair of model responses. In this example, PhyCritic first infers in its own prediction that "the oven is closed". Based on this insight, the model then correctly identifies Response 1 as following the proper causal sequence while Response 2 proposes an unnecessary action. This self-referential process leads to more stable, physically correct judgments.
  • Figure 2: PhyCritic training pipeline. We begin with GRPO training on physical-related QA pairs to enhance the VLM’s physical reasoning ability (left), followed by self-referential critic finetuning to further develop its critique capacity (right).
  • Figure 3: Distribution of prompt sources (left) and model responses (right) in PhyCritic-Bench.
  • Figure 4: Comparison of Best-of-$N$ ensemble mechanisms on CosmosReason1-Bench. Using PhyCritic-7B as the judge consistently improves the base Qwen2.5-VL-7B-Instruct model.