Table of Contents
Fetching ...

Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback

Avinash Anand, Kritarth Prasad, Chhavi Kirtani, Ashwin R Nair, Mohit Gupta, Saloni Garg, Anurag Gautam, Snehal Buldeo, Rajiv Ratn Shah

TL;DR

This work introduces Reinforcement Learning with Human and AI Feedback (RLHAIF) to enhance physics problem-solving by aligning LLM outputs with human preferences and AI-driven assessments. The authors implement a three-phase pipeline—SFT on PhyQA, reward-predictor training from a mixed human/AI preference dataset, and RL fine-tuning with PPO, DPO, or ReMax—across five 7B open-source LLMs, with reward modeling using LLaMA-2-13B. On the PhyQA benchmark, the Mistral-PPO configuration delivers strong reasoning and accuracy, achieving a METEOR of 58.67 and a Reasoning score of 0.74, outperforming prior physics-focused approaches. The study demonstrates that combining human and AI feedback helps improve structured physics reasoning, offering a scalable path toward more robust physics problem-solving in LLMs, while also noting computational and data-quality limitations that shape future work.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems, particularly in advanced arithmetic and conceptual understanding. While some research has explored ways to enhance LLMs in physics education using techniques such as prompt engineering and Retrieval Augmentation Generation (RAG), not enough effort has been made in addressing their limitations in physics reasoning. This paper presents a novel approach to improving LLM performance on physics questions using Reinforcement Learning with Human and Artificial Intelligence Feedback (RLHAIF). We evaluate several reinforcement learning methods, including Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Remax optimization. These methods are chosen to investigate RL policy performance with different settings on the PhyQA dataset, which includes challenging physics problems from high school textbooks. Our RLHAIF model, tested on leading LLMs like LLaMA2 and Mistral, achieved superior results, notably with the MISTRAL-PPO model, demonstrating marked improvements in reasoning and accuracy. It achieved high scores, with a 58.67 METEOR score and a 0.74 Reasoning score, making it a strong example for future physics reasoning research in this area.

Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback

TL;DR

This work introduces Reinforcement Learning with Human and AI Feedback (RLHAIF) to enhance physics problem-solving by aligning LLM outputs with human preferences and AI-driven assessments. The authors implement a three-phase pipeline—SFT on PhyQA, reward-predictor training from a mixed human/AI preference dataset, and RL fine-tuning with PPO, DPO, or ReMax—across five 7B open-source LLMs, with reward modeling using LLaMA-2-13B. On the PhyQA benchmark, the Mistral-PPO configuration delivers strong reasoning and accuracy, achieving a METEOR of 58.67 and a Reasoning score of 0.74, outperforming prior physics-focused approaches. The study demonstrates that combining human and AI feedback helps improve structured physics reasoning, offering a scalable path toward more robust physics problem-solving in LLMs, while also noting computational and data-quality limitations that shape future work.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems, particularly in advanced arithmetic and conceptual understanding. While some research has explored ways to enhance LLMs in physics education using techniques such as prompt engineering and Retrieval Augmentation Generation (RAG), not enough effort has been made in addressing their limitations in physics reasoning. This paper presents a novel approach to improving LLM performance on physics questions using Reinforcement Learning with Human and Artificial Intelligence Feedback (RLHAIF). We evaluate several reinforcement learning methods, including Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Remax optimization. These methods are chosen to investigate RL policy performance with different settings on the PhyQA dataset, which includes challenging physics problems from high school textbooks. Our RLHAIF model, tested on leading LLMs like LLaMA2 and Mistral, achieved superior results, notably with the MISTRAL-PPO model, demonstrating marked improvements in reasoning and accuracy. It achieved high scores, with a 58.67 METEOR score and a 0.74 Reasoning score, making it a strong example for future physics reasoning research in this area.

Paper Structure

This paper contains 13 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A sample question along with Mistral generated responses by RL Policies: PPO, DPO, Remax
  • Figure 2: Our novel procedure for ranking the answers for the Human-AI feedback for Reward Model training
  • Figure 3: Reasoning Score Distribution on Mistral-PPO Model's 100 random sample responses