Table of Contents
Fetching ...

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

Yun Luo, Futing Wang, Qianjia Cheng, Fangchen Yu, Haodi Lei, Jianhao Yan, Chenxi Li, Jiacheng Chen, Yufeng Zhao, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Wenxuan Zeng, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui

TL;DR

P1-VL tackles the challenge of grounding physics reasoning in visual perception by introducing an open-source Vision-Language Model family specialized for physics problems. It combines Curriculum Reinforcement Learning with progressive difficulty and an agentic augmentation framework (PhysicsMinions) to enable iterative self-verification at inference, all trained on a rigorously curated multimodal physics dataset (HiPhO). The approach achieves state-of-the-art open-source performance on HiPhO, with P1-VL-235B-A22B attaining 12 gold medals and, when paired with PhysicsMinions, ranking No.2 globally among evaluated models, while also demonstrating strong generalization across other STEM domains. By open-sourcing P1-VL, the work lays a foundation for general-purpose physical intelligence, enabling models to synthesize visual constraints with causal physical laws and pursue machine-driven scientific discovery.

Abstract

The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

TL;DR

P1-VL tackles the challenge of grounding physics reasoning in visual perception by introducing an open-source Vision-Language Model family specialized for physics problems. It combines Curriculum Reinforcement Learning with progressive difficulty and an agentic augmentation framework (PhysicsMinions) to enable iterative self-verification at inference, all trained on a rigorously curated multimodal physics dataset (HiPhO). The approach achieves state-of-the-art open-source performance on HiPhO, with P1-VL-235B-A22B attaining 12 gold medals and, when paired with PhysicsMinions, ranking No.2 globally among evaluated models, while also demonstrating strong generalization across other STEM domains. By open-sourcing P1-VL, the work lays a foundation for general-purpose physical intelligence, enabling models to synthesize visual constraints with causal physical laws and pursue machine-driven scientific discovery.

Abstract

The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.
Paper Structure (28 sections, 24 equations, 14 figures, 4 tables)

This paper contains 28 sections, 24 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: P1-VL-235B-A22B stands as the state-of-the-art open-source VLM in the Physics Olympiad benchmark (HiPhO), placing No.3 behind Gemini-3-Pro(high) and GPT-5.2(high) and achieving 12 gold medals. Even at mid-scale, P1-VL-30B-A3B achieved 9 gold medals, with a higher average score than most of the open-source models except P1-235B-A22B and DeepSeek-V3.2-Thinking. With the PhysicsMinions agent framework, P1-VL-235B-A22B+PhysicsMinions ranks No.2 on HiPhO.
  • Figure 1: Statistics of the multi-modal training data.
  • Figure 2: A question sample from the International Physics Olympiad 2025 (IPhO 2025), where the question requires measuring the radius of bubbles and estimating their velocity in Fig 2.
  • Figure 3: Distribution of the training data.
  • Figure 4: Data collection pipeline for physics data.
  • ...and 9 more figures