Applications of Improvements to the Pythagorean Won-Loss Expectation in Optimizing Rosters
Alexander F. Almeida, Kevin Dayaratna, Steven J. Miller, Andrew K. Yang
TL;DR
The paper extends the classic Pythagorean Won-Lost framework by allowing runs scored ($RS$) and runs allowed ($RA$) to arise from independent Weibull distributions with distinct shapes $\gamma_{RS}$ and $\gamma_{RA}$ while fixing the shift $\beta=-\tfrac{1}{2}$. Parameters $(\alpha_{RS},\gamma_{RS},\alpha_{RA},\gamma_{RA})$ are estimated via the Method of Moments from the first two moments of observed per-game runs, after which the win probability $P(X>Y)$ is computed numerically as a two-dimensional integral. This Differently-Shaped Weibull (DSW) model yields improved predictive accuracy over the traditional Pythagorean predictor with $\gamma\approx1.83$ across 30 MLB seasons, at the cost of losing closed-form win probability. The approach also provides a framework for evaluating player value and suggests extensions to other sports, including potential incorporation of higher moments and sector-specific exponents to capture era- and league-specific run profiles.
Abstract
Bill James' Pythagorean formula has for decades done an excellent job estimating a baseball team's winning percentage from very little data: if the average runs scored and allowed are denoted respectively by ${\rm RS}$ and ${\rm RA}$, there is some $γ\approx 2$ such that the winning percentage is approximately ${\rm RS}^γ/ ({\rm RS}^γ+ {\rm RA}^γ)$. One use case is to determine the value of potential signings to the team, as it allows us to estimate how many more wins one obtains over a season given an estimated change in run production and concession. We summarize earlier work on the subject, and extend the earlier theoretical model of Miller (who assumed the home and away teams' runs arise from independent Weibull distributions with the same shape parameter $γ$; this has been observed to describe the observed run data well and yields a win probability equivalent to that of James' formula). We extend this work to model runs scored and allowed as being drawn from independent Weibull distributions with different shape parameters, and then consider the first and second moments to solve a system of four equations in the four unknowns. Doing so fits the training data better, yielding a higher winning percentage over the last 30 MLB seasons (1994 to 2023). This comes at a small cost as we no longer have a closed form expression for the win probability, but must evaluate a two-dimensional integral of Weibull distributions and numerically estimate the solutions to the system of equations. These are trivial to do with simple computational programs.
