Table of Contents
Fetching ...

ForTune: Running Offline Scenarios to Estimate Impact on Business Metrics

Georges Dupret, Konstantin Sozinov, Carmen Barcena Gonzalez, Ziggy Zacks, Amber Yuan, Benjamin Carterette, Manuel Mai, Shubham Bansal, Gwo Liang Leo Lien, Andrey Gatash, Roberto Sanchis Ojeda, Mounia Lalmas

TL;DR

ForTune introduces a lightweight, model-free offline approach to anticipate the impact of potential product changes on long-term business metrics by performing scenario-based re-weighting of historical data. It formulates an entropy-maximizing convex optimization to assign weights that satisfy simple, globally stated constraints reflecting the proposed scenario, and then estimates metrics as weighted averages $\hat{t}=\sum_{n} \omega_n t_n$, with uncertainty assessed via bootstrapping. The method is validated on the CRITEO-UPLIFT dataset and proprietary Spotify data, showing directional alignment with treatment outcomes and useful estimates despite simple constraint definitions; comparisons to nearest-neighbor matching provide bounds on performance. The work clarifies how scenario design shapes predictions, highlights limitations such as potential infeasibility or high variance under strict constraints, and demonstrates practical usefulness for prioritizing experiments and exploring long-term trade-offs in product decision-making.

Abstract

Making ideal decisions as a product leader in a web-facing company is extremely difficult. In addition to navigating the ambiguity of customer satisfaction and achieving business goals, one must also pave a path forward for ones' products and services to remain relevant, desirable, and profitable. Data and experimentation to test product hypotheses are key to informing product decisions. Online controlled experiments by A/B testing may provide the best data to support such decisions with high confidence, but can be time-consuming and expensive, especially when one wants to understand impact to key business metrics such as retention or long-term value. Offline experimentation allows one to rapidly iterate and test, but often cannot provide the same level of confidence, and cannot easily shine a light on impact on business metrics. We introduce a novel, lightweight, and flexible approach to investigating hypotheses, called scenario analysis, that aims to support product leaders' decisions using data about users and estimates of business metrics. Its strengths are that it can provide guidance on trade-offs that are incurred by growing or shifting consumption, estimate trends in long-term outcomes like retention and other important business metrics, and can generate hypotheses about relationships between metrics at scale.

ForTune: Running Offline Scenarios to Estimate Impact on Business Metrics

TL;DR

ForTune introduces a lightweight, model-free offline approach to anticipate the impact of potential product changes on long-term business metrics by performing scenario-based re-weighting of historical data. It formulates an entropy-maximizing convex optimization to assign weights that satisfy simple, globally stated constraints reflecting the proposed scenario, and then estimates metrics as weighted averages , with uncertainty assessed via bootstrapping. The method is validated on the CRITEO-UPLIFT dataset and proprietary Spotify data, showing directional alignment with treatment outcomes and useful estimates despite simple constraint definitions; comparisons to nearest-neighbor matching provide bounds on performance. The work clarifies how scenario design shapes predictions, highlights limitations such as potential infeasibility or high variance under strict constraints, and demonstrates practical usefulness for prioritizing experiments and exploring long-term trade-offs in product decision-making.

Abstract

Making ideal decisions as a product leader in a web-facing company is extremely difficult. In addition to navigating the ambiguity of customer satisfaction and achieving business goals, one must also pave a path forward for ones' products and services to remain relevant, desirable, and profitable. Data and experimentation to test product hypotheses are key to informing product decisions. Online controlled experiments by A/B testing may provide the best data to support such decisions with high confidence, but can be time-consuming and expensive, especially when one wants to understand impact to key business metrics such as retention or long-term value. Offline experimentation allows one to rapidly iterate and test, but often cannot provide the same level of confidence, and cannot easily shine a light on impact on business metrics. We introduce a novel, lightweight, and flexible approach to investigating hypotheses, called scenario analysis, that aims to support product leaders' decisions using data about users and estimates of business metrics. Its strengths are that it can provide guidance on trade-offs that are incurred by growing or shifting consumption, estimate trends in long-term outcomes like retention and other important business metrics, and can generate hypotheses about relationships between metrics at scale.
Paper Structure (17 sections, 5 equations, 5 figures, 1 table)

This paper contains 17 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Box plots of the Resampling Weights for the https://www.criteo.com/ dataset. The weights have been multiplied by the number of observations so a weight of 1 means that the observations has the same importance in the control and treatment branches. A weight of 5 means that the corresponding observations is five times more influential in the test branch than before resampling. We set the constraints on features $\mathtt{f}_{1}$, $\mathtt{f}_{4}$, $\mathtt{f}_{7}$ and $\mathtt{f}_{10}$ to be same multiples of the corresponding averages. The multiples are reported on the y axis. The further from the original means (for which the multiple is 1.0), the larger the weights' spread.
  • Figure 2: Probabilities of Visit. The "control" and "treatment" panes report the probability of visit in the control and treatment sets. The panes titled "ForTune" and "match" show the estimated probability of visit $\hat{v}_{\text{test}}$ on the treatment set by the respective methods. We observe that even though the "match" predictions align better with the histogram in the "treatment" pane the "ForTune" predictions are quite good.
  • Figure 3: The business metric is scaled to range between -1 and 1. Consumption distribution is evaluated by bootstrapping (B=50) for each value of the consumption percent lift on the x-axis. The distribution is represented both by a violin plot and a regular box plot. The business metric value is distributed around 0 when the consumption lift is null. The variability results from bootstrapping and gives an estimate of the intrinsic noise in the data.
  • Figure 4: Scaled User Satisfaction. Estimation of user satisfaction in relation to music consumption and discovery of new content based on 50 bootstraps.
  • Figure 5: Distribution of the estimations given by two different scenarios and comparison with the true value. Adding more constraints shifted the distribution of predictions for each bootstrap and made the median of the distribution closer to the actual value.