Table of Contents
Fetching ...

Distributed Convoluted Rank Regression for Non-Shareable Data under Non-Additive Losses

Wen Zhang, Liping Zhu, Songshan Yang

TL;DR

The paper introduces Distributed Convoluted Rank Regression (DCRR) to address high-dimensional regression where data are split across machines and the loss is a non-additive $U$-statistic. By constructing a surrogate loss that combines the CRR loss on a master with a gradient correction from local machines, the method achieves centralized CRR-like efficiency in a distributed setting, despite non-additivity. A two-stage sparse estimation procedure is developed, featuring an $\ell_1$-penalized stage and folded-concave refinement, with non-asymptotic error bounds, a distributed strong oracle property, and a DHBIC-based model selector that is consistent in distributed environments. The framework allows the number of machines to grow with the data, requires only $O(\log N)$ communication rounds, and demonstrates superior performance to naive divide-and-conquer strategies in simulations and a large-scale real-data example, especially under heavy-tailed noise.

Abstract

We study high-dimensional rank regression when data are distributed across multiple machines and the loss is a non-additive U-statistic, as in convoluted rank regression (CRR). Classical communication-efficient surrogate likelihood (CSL) methods crucially rely on the additivity of the empirical loss and therefore break down for CRR, whose global loss couples all sample pairs across machines. We propose a distributed convoluted rank regression (DCRR) framework that constructs a similar surrogate loss and demonstrate its validity under the non-additive losses. We show that this surrogate shares the same population minimizer as the full-data CRR loss and yields estimators that are statistically equivalent to centralized CRR. Building on this, we develop a two-stage sparse DCRR procedure -- an iterative $\ell_1$-penalized stage followed by a folded-concave refinement -- and establish non-asymptotic error bounds, a distributed strong oracle property, and a DHBIC-type criterion for consistent model selection. A scaling result shows that the number of machines may diverge as $M = o({N/(s^2\log p)})$ while achieving centralized oracle rates with only $O(\log N)$ communication rounds. Simulations and a large-scale real data example demonstrate substantial gains over naive divide-and-conquer, particularly under heavy-tailed errors.

Distributed Convoluted Rank Regression for Non-Shareable Data under Non-Additive Losses

TL;DR

The paper introduces Distributed Convoluted Rank Regression (DCRR) to address high-dimensional regression where data are split across machines and the loss is a non-additive -statistic. By constructing a surrogate loss that combines the CRR loss on a master with a gradient correction from local machines, the method achieves centralized CRR-like efficiency in a distributed setting, despite non-additivity. A two-stage sparse estimation procedure is developed, featuring an -penalized stage and folded-concave refinement, with non-asymptotic error bounds, a distributed strong oracle property, and a DHBIC-based model selector that is consistent in distributed environments. The framework allows the number of machines to grow with the data, requires only communication rounds, and demonstrates superior performance to naive divide-and-conquer strategies in simulations and a large-scale real-data example, especially under heavy-tailed noise.

Abstract

We study high-dimensional rank regression when data are distributed across multiple machines and the loss is a non-additive U-statistic, as in convoluted rank regression (CRR). Classical communication-efficient surrogate likelihood (CSL) methods crucially rely on the additivity of the empirical loss and therefore break down for CRR, whose global loss couples all sample pairs across machines. We propose a distributed convoluted rank regression (DCRR) framework that constructs a similar surrogate loss and demonstrate its validity under the non-additive losses. We show that this surrogate shares the same population minimizer as the full-data CRR loss and yields estimators that are statistically equivalent to centralized CRR. Building on this, we develop a two-stage sparse DCRR procedure -- an iterative -penalized stage followed by a folded-concave refinement -- and establish non-asymptotic error bounds, a distributed strong oracle property, and a DHBIC-type criterion for consistent model selection. A scaling result shows that the number of machines may diverge as while achieving centralized oracle rates with only communication rounds. Simulations and a large-scale real data example demonstrate substantial gains over naive divide-and-conquer, particularly under heavy-tailed errors.
Paper Structure (17 sections, 28 equations, 4 tables, 1 algorithm)