Statistical Guarantees for Offline Domain Randomization
Arnaud Fickinger, Abderrahim Bendahi, Stuart Russell
TL;DR
This paper addresses how to use offline real-world data to guide domain randomization for robust sim-to-real transfer in RL. It casts Offline Domain Randomization (ODR) as maximum-likelihood estimation over a parametric simulator family and proves weak consistency (convergence in probability) of the ODR estimator under mild regularity, positivity, and identifiability assumptions, upgrading to strong consistency (almost sure convergence) when a uniform Lipschitz condition holds. The authors discuss practicality, relaxations (e.g., stationarity/ergodicity, tail-based positivity), and provide a model-agnostic notion of informativeness, showing that data-informed ODR concentrates the learned distribution near the true dynamics. Collectively, these results place ODR on a principled footing, clarifying when offline data can safely guide the randomization distribution for downstream offline RL and improving data efficiency in sim-to-real pipelines.
Abstract
Reinforcement-learning (RL) agents often struggle when deployed from simulation to the real-world. A dominant strategy for reducing the sim-to-real gap is domain randomization (DR) which trains the policy across many simulators produced by sampling dynamics parameters, but standard DR ignores offline data already available from the real system. We study offline domain randomization (ODR), which first fits a distribution over simulator parameters to an offline dataset. While a growing body of empirical work reports substantial gains with algorithms such as DROPO, the theoretical foundations of ODR remain largely unexplored. In this work, we cast ODR as a maximum-likelihood estimation over a parametric simulator family and provide statistical guarantees: under mild regularity and identifiability conditions, the estimator is weakly consistent (it converges in probability to the true dynamics as data grows), and it becomes strongly consistent (i.e., it converges almost surely to the true dynamics) when an additional uniform Lipschitz continuity assumption holds. We examine the practicality of these assumptions and outline relaxations that justify ODR's applicability across a broader range of settings. Taken together, our results place ODR on a principled footing and clarify when offline data can soundly guide the choice of a randomization distribution for downstream offline RL.
