A Note on Asynchronous Challenges: Unveiling Formulaic Bias and Data Loss in the Hayashi-Yoshida Estimator
Evangelos Georgiadis
TL;DR
This work uncovers an intrinsic formulaic bias in the Hayashi-Yoshida estimator for asynchronous data, where telescoping in the summation can cause data points to cancel out and become nonextant. It introduces an $(a,b)$-asynchronous adversary based on independent Poisson inputs to model asynchronous observations and derives a closed-form expression for the expected proportion of nonextant data points, $f(a,b)=\left(\frac{a}{a+b}\right)^3+\left(\frac{b}{a+b}\right)^3$, with a minimum of $1/4$ (25%) at $a=b$. The authors provide algorithms to count nonextant data points, prove necessary and sufficient conditions for their occurrence, and validate the theory with Monte Carlo simulations comparing cumulative data loss to the theoretical benchmark. The results highlight fundamental limitations in using HY for fine-grained lead-lag analysis under asynchrony and offer a framework for assessing estimator efficiency and robustness in the presence of intrinsic bias.
Abstract
The Hayashi-Yoshida (\HY)-estimator exhibits an intrinsic, telescoping property that leads to an often overlooked computational bias, which we denote,formulaic or intrinsic bias. This formulaic bias results in data loss by cancelling out potentially relevant data points, the nonextant data points. This paper attempts to formalize and quantify the data loss arising from this bias. In particular, we highlight the existence of nonextant data points via a concrete example, and prove necessary and sufficient conditions for the telescoping property to induce this type of formulaic bias.Since this type of bias is nonexistent when inputs, i.e., observation times, $Π^{(1)} :=(t_i^{(1)})_{i=0,1,\ldots}$ and $Π^{(2)} :=(t_j^{(2)})_{j=0,1,\ldots}$, are synchronous, we introduce the (a,b)-asynchronous adversary. This adversary generates inputs $Π^{(1)}$ and $Π^{(2)}$ according to two independent homogenous Poisson processes with rates a>0 and b>0, respectively. We address the foundational questions regarding cumulative minimal (or least) average data point loss, and determine the values for a and b. We prove that for equal rates a=b, the minimal average cumulative data loss over both inputs is attained and amounts to 25\%. We present an algorithm, which is based on our theorem, for computing the exact number of nonextant data points given inputs $Π^{(1)}$ and $Π^{(2)}$, and suggest alternative methods. Finally, we use simulated data to empirically compare the (cumulative) average data loss of the (\HY)-estimator.
