Table of Contents
Fetching ...

Do Math Reasoning LLMs Help Predict the Impact of Public Transit Events?

Bowen Fang, Ruijian Zha, Xuan Di

TL;DR

This work tackles predicting realized transit-incident durations from unstructured GTFS-rt alerts, addressing the limitations of standard supervised fine-tuning under noisy, continuous targets. It introduces Reinforcement Learning from Verifiable Rewards (RLVR) with a tolerance-based, shaped reward to provide partial credit for predictions within a defined error margin, adapting verifiers to continuous forecasting. The authors curate a NYC MTA alert-duration dataset and demonstrate that general-purpose instruction-tuned LLMs outperform math-focused models on this noisy task, with the shaped RLVR reward driving strong performance at tight accuracy bands (e.g., Acc@5). They further show that an appropriate verifier design and prompt strategy yield a 35% relative improvement in Acc@5 over the strongest baseline, indicating RLVR can be effectively applied to real-world, noisy forecasting when rewards reflect the problem’s continuous nature.

Abstract

Predicting public transit incident duration from unstructured text alerts is a critical but challenging task. Addressing the domain sparsity of transit operations with standard Supervised Fine-Tuning (SFT) is difficult, as the task involves noisy, continuous labels and lacks reliable expert demonstrations for reasoning. While Reinforcement Learning from Verifiable Rewards (RLVR) excels at tasks with binary correctness, like mathematics, its applicability to noisy, continuous forecasting is an open question. This work, to our knowledge, is the first to bridge the gap between RLVR LLM training with the critical, real-world forecasting challenges in public transit operations. We adapt RLVR to this task by introducing a tolerance-based, shaped reward function that grants partial credit within a continuous error margin, rather than demanding a single correct answer. We systematically evaluate this framework on a curated dataset of NYC MTA service alerts. Our findings show that general-purpose, instruction-tuned LLMs significantly outperform specialized math-reasoning models, which struggle with the ambiguous, real-world text. We empirically demonstrate that the binary reward is unstable and degrades performance, whereas our shaped reward design is critical and allows our model to dominate on the most challenging metrics. While classical regressors are superior at minimizing overall MAE or MSE, our RLVR approach achieved a 35\% relative improvement in 5-minute accuracy (Acc@5) over the strongest baseline. This demonstrates that RLVR can be successfully adapted to real-world, noisy forecasting, but requires a verifier design that reflects the continuous nature of the problem.

Do Math Reasoning LLMs Help Predict the Impact of Public Transit Events?

TL;DR

This work tackles predicting realized transit-incident durations from unstructured GTFS-rt alerts, addressing the limitations of standard supervised fine-tuning under noisy, continuous targets. It introduces Reinforcement Learning from Verifiable Rewards (RLVR) with a tolerance-based, shaped reward to provide partial credit for predictions within a defined error margin, adapting verifiers to continuous forecasting. The authors curate a NYC MTA alert-duration dataset and demonstrate that general-purpose instruction-tuned LLMs outperform math-focused models on this noisy task, with the shaped RLVR reward driving strong performance at tight accuracy bands (e.g., Acc@5). They further show that an appropriate verifier design and prompt strategy yield a 35% relative improvement in Acc@5 over the strongest baseline, indicating RLVR can be effectively applied to real-world, noisy forecasting when rewards reflect the problem’s continuous nature.

Abstract

Predicting public transit incident duration from unstructured text alerts is a critical but challenging task. Addressing the domain sparsity of transit operations with standard Supervised Fine-Tuning (SFT) is difficult, as the task involves noisy, continuous labels and lacks reliable expert demonstrations for reasoning. While Reinforcement Learning from Verifiable Rewards (RLVR) excels at tasks with binary correctness, like mathematics, its applicability to noisy, continuous forecasting is an open question. This work, to our knowledge, is the first to bridge the gap between RLVR LLM training with the critical, real-world forecasting challenges in public transit operations. We adapt RLVR to this task by introducing a tolerance-based, shaped reward function that grants partial credit within a continuous error margin, rather than demanding a single correct answer. We systematically evaluate this framework on a curated dataset of NYC MTA service alerts. Our findings show that general-purpose, instruction-tuned LLMs significantly outperform specialized math-reasoning models, which struggle with the ambiguous, real-world text. We empirically demonstrate that the binary reward is unstable and degrades performance, whereas our shaped reward design is critical and allows our model to dominate on the most challenging metrics. While classical regressors are superior at minimizing overall MAE or MSE, our RLVR approach achieved a 35\% relative improvement in 5-minute accuracy (Acc@5) over the strongest baseline. This demonstrates that RLVR can be successfully adapted to real-world, noisy forecasting, but requires a verifier design that reflects the continuous nature of the problem.

Paper Structure

This paper contains 47 sections, 19 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Transit disruption duration distributions per category. Duration extracted from NYC MTA service alerts starting from April 28, 2020. Categories are extracted from GTFS-rt alerts using LLM, detailed in §\ref{['sec:eda']}. Histograms reveal distinct patterns based on incident type. Passenger incidents and train mechanical issues are heavily right-skewed, with most resolving quickly (modes at 5–15 min). In contrast, external factors and operational events display broader distributions with longer tails, while operational incidents show the highest variance. Blue dashed lines indicate the median duration for each category.
  • Figure 2: Map of related work in LLMs for transit and urban tasks. The figure positions our research relative to existing literature along two primary axes. Horizontal Axis (Training/Adaptation Method): This axis represents the method used to adapt the LLM to domain knowledge, progressing from non-LLM/VLM baselines and inference-time Prompting, to SFT (Supervised Fine-Tuning) on demonstrations, and finally to RLVR (Reinforcement Learning from Verifiable Rewards). Vertical Axis (Task Family): This axis categorizes the application domain, including user-facing Interfaces, Agentic ops (agent-based operations), Urban ST (urban spatio-temporal forecasting), and our specific focus, Incident duration prediction. While prior work has applied prompting and SFT to related urban tasks, our contribution (highlighted in the top-right) is the first to adapt RLVR for text-grounded incident duration forecasting.
  • Figure 3: Framework overview for incident-duration modeling from service alerts. Top (Inference): An incoming alert serves as the Instance Prompt. It is combined with offline-extracted Global Features and a System Prompt to elicit reasoning. These inputs condition a frozen model ($\theta$) to produce a duration estimate. The Evaluation panel shows how this prediction is scored using a tolerance ruler (Acc@$\,\delta$, Soft@$\,\delta$) and standard metrics (MAE, MSE). Bottom (Training): The model is trained using RLVR. A ground-truth duration is derived from the full Alert Sequence. The Verifier Reward function compares the model's prediction to this ground truth, assigning a reward based on tolerance bands ($\delta$) and optional shaping (e.g., soft vs. hard, $\alpha$ scaling). This reward signal updates the reasoning model ($\theta$). Figure Legend:Solid outlines denote the fixed inference path; dashed outlines represent ablated design choices (prompting, reward variants, backbones). Colors indicate component roles: green for prompting context, yellow for the verifier, gray for the model, and blue for evaluation. Logos indicate interchangeable model backbones.
  • Figure 4: Top 10 most frequent event types. Among 26 fine-grained event types, train service updates (13.2%, N=2,784) and mechanical problems (12.8%, N=2,702) are most common. Medical assistance is the leading passenger incident type (8.7%, N=1,835). The top 10 types account for 67.3% of all events.
  • Figure 5: Event type distribution within major categories. Internal composition of the three largest categories reveals concentrated patterns: Train Mechanical dominated by mechanical problems (35.8%) and brake activations (23.9%); External Factors primarily delays (60.4%); Passenger Incidents led by medical assistance (44.8%).
  • ...and 8 more figures