Table of Contents
Fetching ...

From Models to Systems: A Comprehensive Fairness Framework for Compositional Recommender Systems

Brian Hsu, Cyrus DiCiccio, Natesh Sivasubramoniapillai, Hongseok Namkoong

TL;DR

This paper argues that fairness in industrial recommender systems requires a system-level perspective that spans retrieval, scoring, and serving, rather than focusing on single-model fairness. It formalizes a compositional utility framework where end-user utility is driven by group-specific preferences and calibrated model outputs, and demonstrates that disparities can persist due to upstream–downstream interactions and heterogeneous user preferences. To mitigate these disparities, it introduces a Bayes-opt optimization approach that jointly optimizes overall utility and the Deviation from Equal Representation (DER) metric via Expected Hyper-Volume Improvement (EHVI), integrating multi-label fairness with downstream business objectives. Empirical results on synthetic and real datasets show that the proposed Fair EHVI method yields better Pareto frontiers for utility and fairness than baselines, underscoring the value of system-level fairness tools for deployment contexts and regulatory regimes. The work highlights practical implications for timescale considerations and governance when pursuing equity across diverse user populations in large-scale recommender pipelines.

Abstract

Fairness research in machine learning often centers on ensuring equitable performance of individual models. However, real-world recommendation systems are built on multiple models and even multiple stages, from candidate retrieval to scoring and serving, which raises challenges for responsible development and deployment. This system-level view, as highlighted by regulations like the EU AI Act, necessitates moving beyond auditing individual models as independent entities. We propose a holistic framework for modeling system-level fairness, focusing on the end-utility delivered to diverse user groups, and consider interactions between components such as retrieval and scoring models. We provide formal insights on the limitations of focusing solely on model-level fairness and highlight the need for alternative tools that account for heterogeneity in user preferences. To mitigate system-level disparities, we adapt closed-box optimization tools (e.g., BayesOpt) to jointly optimize utility and equity. We empirically demonstrate the effectiveness of our proposed framework on synthetic and real datasets, underscoring the need for a system-level framework.

From Models to Systems: A Comprehensive Fairness Framework for Compositional Recommender Systems

TL;DR

This paper argues that fairness in industrial recommender systems requires a system-level perspective that spans retrieval, scoring, and serving, rather than focusing on single-model fairness. It formalizes a compositional utility framework where end-user utility is driven by group-specific preferences and calibrated model outputs, and demonstrates that disparities can persist due to upstream–downstream interactions and heterogeneous user preferences. To mitigate these disparities, it introduces a Bayes-opt optimization approach that jointly optimizes overall utility and the Deviation from Equal Representation (DER) metric via Expected Hyper-Volume Improvement (EHVI), integrating multi-label fairness with downstream business objectives. Empirical results on synthetic and real datasets show that the proposed Fair EHVI method yields better Pareto frontiers for utility and fairness than baselines, underscoring the value of system-level fairness tools for deployment contexts and regulatory regimes. The work highlights practical implications for timescale considerations and governance when pursuing equity across diverse user populations in large-scale recommender pipelines.

Abstract

Fairness research in machine learning often centers on ensuring equitable performance of individual models. However, real-world recommendation systems are built on multiple models and even multiple stages, from candidate retrieval to scoring and serving, which raises challenges for responsible development and deployment. This system-level view, as highlighted by regulations like the EU AI Act, necessitates moving beyond auditing individual models as independent entities. We propose a holistic framework for modeling system-level fairness, focusing on the end-utility delivered to diverse user groups, and consider interactions between components such as retrieval and scoring models. We provide formal insights on the limitations of focusing solely on model-level fairness and highlight the need for alternative tools that account for heterogeneity in user preferences. To mitigate system-level disparities, we adapt closed-box optimization tools (e.g., BayesOpt) to jointly optimize utility and equity. We empirically demonstrate the effectiveness of our proposed framework on synthetic and real datasets, underscoring the need for a system-level framework.

Paper Structure

This paper contains 33 sections, 3 theorems, 24 equations, 8 figures, 2 tables.

Key Result

Lemma 2.3

Suppose individual models $f_{k}(X,Z^{j})$ are calibrated with respect to their intended label $Y_{k}$ across the entire feature space: $\mathbb{E}\left[Y_{k}^{j} \mid f_{1}(X, Z^{j}), \ldots, f_{K}(X, Z^{j})\right] = f_{k}(X,Z^{j})$ a.s. and that true and serving preferences are positive $\{\alpha^

Figures (8)

  • Figure 1: AI Recommendation System Serving Pipeline. Recommendations for feeds, ads, and social networking are generated from a multi-step process involving multiple ML models. An upstream process first fetches potentially relevant items via a candidate retrieval model, often called a “first pass ranker” (FPR), reducing the item set from millions to hundreds/thousands using scalable methods like approximate nearest neighbors liu2004investigation. Then, each model $f_{k}$ scores the items independently on based on probability of $\{$View, Click, Apply$\}$---this stage of ML models (red) are the overwhelming focus of fairness literature and audits. Finally, to surface the most relevant items, a "second pass ranker" (SPR) combines these individual models through a weighted sum $\sum_{k}\alpha_{k}f_{k}$. While the SPR is simple, its intuitiveness has led to its ubiquity, with the world's largest platforms such as Meta Instagram, LinkedIn LinkedIn, Microsoft Microsoft, X/Twitter Twitter, Snapchat Snapchat_spr, Pinterest Pinterest, and Spotify Lamere_2021 stating or suggesting that they use a variant of this overarching system.
  • Figure 2: (Left) Proxy and metric tracking (Right) Issues of vanilla BayesOpt
  • Figure 3: Comparison of methods on scalarized outcome
  • Figure 4: Pareto frontiers for the four tested datasets
  • Figure 5: Utility and DER surface - Synthetic data
  • ...and 3 more figures

Theorems & Definitions (10)

  • Definition 2.1: Best Item for Serving
  • Definition 2.2: User Utility
  • Lemma 2.3
  • Theorem 3.1: Utility Gap Bound From Preference Misspecification
  • Definition 3.2: Candidate Retrieval Model Quality
  • Theorem 3.3: Utility Gap Bound From Candidate Retrieval Performance Degradation
  • Definition 4.1
  • proof
  • proof
  • proof