On the Convergence of Multicalibration Gradient Boosting

Daniel Haimovich; Fridolin Linder; Lorenzo Perini; Niek Tax; Milan Vojnovic

On the Convergence of Multicalibration Gradient Boosting

Daniel Haimovich, Fridolin Linder, Lorenzo Perini, Niek Tax, Milan Vojnovic

TL;DR

This work provides the first convergence guarantees for multicalibration gradient boosting in regression with squared-error loss. Modeling the boosting dynamics as a discrete-time system, the authors show the prediction-update gap declines at rate $O(1/\sqrt{T})$, ensuring asymptotic multicalibration, with linear convergence under smooth weak learners. They further extend the analysis to relaxed and adaptive rescaling schemes, proving robustness and, in the adaptive case, quadratic convergence near the optimum. Empirical results on five real-world datasets corroborate the theory, showing geometric decay of the prediction gap and practical benefits of rescaling strategies. The findings supply a theoretical foundation for deploying multicalibration boosting in production and inform design choices like stopping rules and regularization.

Abstract

Multicalibration gradient boosting has recently emerged as a scalable method that empirically produces approximately multicalibrated predictors and has been deployed at web scale. Despite this empirical success, its convergence properties are not well understood. In this paper, we bridge the gap by providing convergence guarantees for multicalibration gradient boosting in regression with squared-error loss. We show that the magnitude of successive prediction updates decays at $O(1/\sqrt{T})$, which implies the same convergence rate bound for the multicalibration error over rounds. Under additional smoothness assumptions on the weak learners, this rate improves to linear convergence. We further analyze adaptive variants, showing local quadratic convergence of the training loss, and we study rescaling schemes that preserve convergence. Experiments on real-world datasets support our theory and clarify the regimes in which the method achieves fast convergence and strong multicalibration.

On the Convergence of Multicalibration Gradient Boosting

TL;DR

, ensuring asymptotic multicalibration, with linear convergence under smooth weak learners. They further extend the analysis to relaxed and adaptive rescaling schemes, proving robustness and, in the adaptive case, quadratic convergence near the optimum. Empirical results on five real-world datasets corroborate the theory, showing geometric decay of the prediction gap and practical benefits of rescaling strategies. The findings supply a theoretical foundation for deploying multicalibration boosting in production and inform design choices like stopping rules and regularization.

Abstract

, which implies the same convergence rate bound for the multicalibration error over rounds. Under additional smoothness assumptions on the weak learners, this rate improves to linear convergence. We further analyze adaptive variants, showing local quadratic convergence of the training loss, and we study rescaling schemes that preserve convergence. Experiments on real-world datasets support our theory and clarify the regimes in which the method achieves fast convergence and strong multicalibration.

Paper Structure (50 sections, 13 theorems, 158 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 50 sections, 13 theorems, 158 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Related Work
Foundations of Multicalibration.
Multicalibration via Gradient Boosting.
Extensions Beyond the Mean.
Proof Techniques.
Preliminaries and Problem Setup
Multicalibration and Notation.
Factorised Hypothesis Classes.
The Multicalibration Boosting Framework.
Main Results
Assumption: Employing a Boosting Oracle.
Fundamental Convergence Guarantees
When is Convergence Fast?
Relaxed and Adaptive Rescaling Strategies
...and 35 more sections

Key Result

Theorem 4.1

Consider the dynamical system in Eq. (equ:dtds) with constant unit rescaling weights ($w_t = 1$). Then:

Figures (3)

Figure 1: The prediction gap $\|f_{t+1}-f_t\|_2$ for all multicalibration boosting rounds $t\le 20$. The bottom row shows the corresponding plots on a log–lin scale, along with the best fitting line (black) and its $R^2$ score.
Figure 2: Evolution of the training MSE (top row) and training MCE (bottom row) over 20 boosting rounds across five datasets.
Figure 3: Average test Multicalibration Error (MCE) per dataset, with star markers indicating the optimal stopping point for each strategy.

Theorems & Definitions (23)

Theorem 4.1: Convergence with No Rescaling
proof : Sketch
Theorem 4.2: Linear Convergence under Smoothness
proof : Sketch
Lemma 4.3: Smoothness of $A(f)$
proof : Sketch
Lemma 4.4: Smoothness of $B(f)$ for Factorized Learners
proof : Sketch
Theorem 4.5: Convergence with Relaxed Weights
proof : Sketch
...and 13 more

On the Convergence of Multicalibration Gradient Boosting

TL;DR

Abstract

On the Convergence of Multicalibration Gradient Boosting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)