Table of Contents
Fetching ...

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

Tarjei Paule Hage, Markus J. Buehler

TL;DR

The results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning, and reveal a key limitation of outcome-level alignment.

Abstract

Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

TL;DR

The results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning, and reveal a key limitation of outcome-level alignment.

Abstract

Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.
Paper Structure (22 sections, 31 equations, 11 figures, 2 tables)

This paper contains 22 sections, 31 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Schematic of a simply supported beam defined on the one-dimensional domain $x\in [0, L]$, where $x$ denotes the axial coordinate along the beam and $L$ is the beam length. The beam is supported by a pinned support located at $x_{\text{pin}} \in [0, L]$ and a roller support located at $x_{\text{roller}} \in [0, L]$, with $x_{\text{pin}} \neq x_{\text{roller}}$. The beam is subjected to $N$ transverse point loads $P_i$, applied at positions $x_i \in [0, L]$, with $x_i \neq x_j \; \forall \; i \neq j$, for $i = 1, \dots, N$, where in this schematic $N=2$.
  • Figure 2: End-to-end dataset generation pipeline for beam-mechanics problem-solving. Discrete beam configurations are first sampled from a symbolic parameter space and solved analytically using a symbolic mechanics solver to obtain exact ground-truth support reactions. An LLM is used to generate multiple natural-language problem formulations for each beam configuration, while the underlying physical solution remains unchanged. This many-to-one construction yields a verifiable question–answer dataset with multiple linguistic variants mapped to a single mechanically correct solution.
  • Figure 3: PE-RLVR-FT workflow for adapting a distilled LRM to beam-mechanics problem-solving. Beam-mechanics questions from the synthetic dataset are used to prompt a frozen base model augmented with trainable LoRA adapters. For each prompt, the model samples a group of $G$ candidate responses, which are evaluated by a deterministic reward function based on format adherence and symbolic beam statics correctness, each assigning a binary reward. These rewards are converted into relative advantage signals $A_1,\dots,A_G$ and used by GRPO to update only the LoRA parameters, while all pretrained backbone weights remain frozen. A more detailed description of this process is provided in Section \ref{['subsec:methods_training']}. Adapted from deepseek-ai_deepseek-r1_2025.
  • Figure 4: Training performance against the cumulative number of training examples. (a) Total weighted reward. (b) Unscaled format and accuracy rewards. All rewards are averaged across all rollouts for all observed examples at each training step. Both sub-rewards increase sharply during the early training phase, after which the accuracy reward slightly decreases while the format reward remains consistently high, indicating that early performance gains are primarily associated with improved output formatting.
  • Figure 5: Evolution of training reward, completion length, and KL divergence plotted against the cumulative number of training examples. In (a), reward and rollout completion length are shown on separate y-axes, with completion length, in tokens, decreasing early in training and stabilizing at a task-appropriate token length. In (b), reward and KL divergence are shown on separate y-axes, with KL divergence increasing conservatively at first and more sharply at later stages. All values are averaged across all rollouts for all observed examples at each training step.
  • ...and 6 more figures