Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance
Lang Zeng, Weijing Tang, Zhao Ren, Ying Ding
TL;DR
This work develops a theoretical and practical framework for mini-batch estimation in deep Cox models by introducing the mini-batch maximum partial-likelihood estimator (mb-MPLE). It proves mb-MPLE is consistent and attains minimax-optimal convergence rates for Cox-NN, with rates governed by intrinsic function complexity rather than ambient dimension, and shows for Cox regression that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with batch-size dependent variance. The paper also provides actionable SGD guidance, demonstrating that the learning-rate-to-batch-size ratio critically shapes training dynamics and that larger batches improve local convexity and efficiency, validated through simulations and a large AREDS dataset analysis with a ResNet-50 backbone. Collectively, the results offer a rigorous statistical foundation and practical guidance for scalable survival analysis with deep learning, bridging theory and real-world applicability.
Abstract
The stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases, which is confirmed by simulation studies. Additionally, we offer practical guidance on using SGD, supported by theoretical analysis and numerical evidence. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable.
