Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Shunfeng Zheng; Yudi Zhang; Meng Fang; Zihan Zhang; Zhitan Wu; Mykola Pechenizkiy; Ling Chen

Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Shunfeng Zheng, Yudi Zhang, Meng Fang, Zihan Zhang, Zhitan Wu, Mykola Pechenizkiy, Ling Chen

TL;DR

This work introduces PhoPile, the first multimodal retrieval-augmented generation benchmark for Olympiad-level physics problem solving, combining a 390-question evaluation set with a 2,662-question retrieval corpus that includes diagrams, graphs, and equations. The authors design an LLM-as-judge evaluation framework and benchmark 8 foundation models with 4 text-only and 3 multimodal retrievers, plus several fine-tuned open-source models. Results show that retrieval can improve physics reasoning (e.g., notable gains for Gemini-Pro and LLaMA-3 with BM25/Contriever) but also that noise and non-domain-specific retrieval often harm performance, underscoring the need for domain-aware retrievers and robust multimodal integration. The study highlights both the potential of RAG to enhance physics reasoning and the remaining challenges, motivating future work on cross-modal retrieval, principled prompting, and improved evaluation.

Abstract

Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.

Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

TL;DR

Abstract

Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)