Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking
Jisu Kim, Alex Mattingly, Eung-Joo Lee, Benjamin S. Riggan
TL;DR
The paper tackles tiny head detection in crowded scenes under stringent resource constraints. It introduces a training-time framework that blends a cross-domain detection loss, a multi-scale training module, and a Small Receptive Field Detection (SRFD) block, while inference uses a compact backbone with channel alignment and tracking-by-detection. The cross-domain loss, defined as $L_{CDDL} = \lambda_1 L_{CIoU_DA} + \lambda_2 L_{DFL_DA} + \lambda_3 L_{BCE_DA}$, fuses outputs from rich- and poor-quality detectors to improve robustness across domains; the multi-scale module enables inference-time efficiency by transferring training-time multi-scale information; SRFD enhances tiny-object representation within a four-level feature pyramid. Evaluations on CroHD and CrowdHuman show improved MOTA and mAP with lower FLOPs compared to strong baselines, demonstrating real-time applicability for tiny head tracking in crowded scenes on resource-constrained devices.
Abstract
Head detection and tracking are essential for downstream tasks, but current methods often require large computational budgets, which increase latencies and ties up resources (e.g., processors, memory, and bandwidth). To address this, we propose a framework to enhance tiny head detection and tracking by optimizing the balance between performance and efficiency. Our framework integrates (1) a cross-domain detection loss, (2) a multi-scale module, and (3) a small receptive field detection mechanism. These innovations enhance detection by bridging the gap between large and small detectors, capturing high-frequency details at multiple scales during training, and using filters with small receptive fields to detect tiny heads. Evaluations on the CroHD and CrowdHuman datasets show improved Multiple Object Tracking Accuracy (MOTA) and mean Average Precision (mAP), demonstrating the effectiveness of our approach in crowded scenes.
