Loss Functions for Detecting Outliers in Panel Data
Charles D. Coleman, Thomas Bryan
TL;DR
The paper addresses detecting outliers in panel data without relying on distributional assumptions by developing a loss-function framework. It builds separate treatments for nonnegative data and data with mixed signs, introducing unsigned and signed loss functions, time-invariant variants, and cross-set comparison forms, all guided by Lie symmetries and monotonicity constraints. Key contributions include explicit loss forms such as $L(F;B)=|F-B|^p B^q$ with $p=1$, $q>-1$, the time-invariant extension $L(F_{it};B_i,t)=|F_{it}-B_i|B_i^{tq+t-1}$, and the signed version $S(F,B)=(F-B)B^q$; plus a mixed-sign construction with $ ext{Sigma}=|F|+|B|$ and $L(F,B)=|F-B|^p(|F|+|B|)^qB$. The framework is demonstrated via applications to preexisting outlier criteria, GIS-based comparisons, and variable-classified outliers, and it provides guidance on selecting parameters like $q$ (often around $-0.5$) and interpreting results without distributional assumptions. The work offers a practical, scalable tool for data quality and anomaly detection in panel data with broad applicability to demographics, economics, and geographic information systems. Overall, it enables principled, model-free outlier detection across time and data sign regimes.
Abstract
The detection of outliers is of critical importance in the assurance of data quality. Outliers may exist in observed data or in data derived from these observed data, such as estimates and forecasts. An outlier may indicate a problem with its data generation process or may simply be a true, but unusual, statement about the world. Without making any distributional assumptions, we proposes the use of loss functions to detect these outliers in panel data. Part I covers nonnegative data. We axiomatically derive an unsigned loss function. We then develop a signed loss function ito account for positive and negative outliers separately. In the case of nominal time we obtain an exact parametrization of the loss function. A time-invariant loss function permits the comparison of data at multiple times on the same basis. We provide several examples, including an example in which the outliers are classified by another variable. Part II covers data of mixed sign. Similar to Part I, we axiomatically develop unsigned and signed loss functions. We search for optimal values of the loss function parameter using graphs.
