Table of Contents
Fetching ...

Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking

Jisu Kim, Alex Mattingly, Eung-Joo Lee, Benjamin S. Riggan

TL;DR

The paper tackles tiny head detection in crowded scenes under stringent resource constraints. It introduces a training-time framework that blends a cross-domain detection loss, a multi-scale training module, and a Small Receptive Field Detection (SRFD) block, while inference uses a compact backbone with channel alignment and tracking-by-detection. The cross-domain loss, defined as $L_{CDDL} = \lambda_1 L_{CIoU_DA} + \lambda_2 L_{DFL_DA} + \lambda_3 L_{BCE_DA}$, fuses outputs from rich- and poor-quality detectors to improve robustness across domains; the multi-scale module enables inference-time efficiency by transferring training-time multi-scale information; SRFD enhances tiny-object representation within a four-level feature pyramid. Evaluations on CroHD and CrowdHuman show improved MOTA and mAP with lower FLOPs compared to strong baselines, demonstrating real-time applicability for tiny head tracking in crowded scenes on resource-constrained devices.

Abstract

Head detection and tracking are essential for downstream tasks, but current methods often require large computational budgets, which increase latencies and ties up resources (e.g., processors, memory, and bandwidth). To address this, we propose a framework to enhance tiny head detection and tracking by optimizing the balance between performance and efficiency. Our framework integrates (1) a cross-domain detection loss, (2) a multi-scale module, and (3) a small receptive field detection mechanism. These innovations enhance detection by bridging the gap between large and small detectors, capturing high-frequency details at multiple scales during training, and using filters with small receptive fields to detect tiny heads. Evaluations on the CroHD and CrowdHuman datasets show improved Multiple Object Tracking Accuracy (MOTA) and mean Average Precision (mAP), demonstrating the effectiveness of our approach in crowded scenes.

Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking

TL;DR

The paper tackles tiny head detection in crowded scenes under stringent resource constraints. It introduces a training-time framework that blends a cross-domain detection loss, a multi-scale training module, and a Small Receptive Field Detection (SRFD) block, while inference uses a compact backbone with channel alignment and tracking-by-detection. The cross-domain loss, defined as , fuses outputs from rich- and poor-quality detectors to improve robustness across domains; the multi-scale module enables inference-time efficiency by transferring training-time multi-scale information; SRFD enhances tiny-object representation within a four-level feature pyramid. Evaluations on CroHD and CrowdHuman show improved MOTA and mAP with lower FLOPs compared to strong baselines, demonstrating real-time applicability for tiny head tracking in crowded scenes on resource-constrained devices.

Abstract

Head detection and tracking are essential for downstream tasks, but current methods often require large computational budgets, which increase latencies and ties up resources (e.g., processors, memory, and bandwidth). To address this, we propose a framework to enhance tiny head detection and tracking by optimizing the balance between performance and efficiency. Our framework integrates (1) a cross-domain detection loss, (2) a multi-scale module, and (3) a small receptive field detection mechanism. These innovations enhance detection by bridging the gap between large and small detectors, capturing high-frequency details at multiple scales during training, and using filters with small receptive fields to detect tiny heads. Evaluations on the CroHD and CrowdHuman datasets show improved Multiple Object Tracking Accuracy (MOTA) and mean Average Precision (mAP), demonstrating the effectiveness of our approach in crowded scenes.

Paper Structure

This paper contains 17 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of head detection performance between the base models (top, middle) and our framework (bottom). The top model has high computational cost with good accuracy but lacks the small receptive field detection block, domain adaptation, and multi-scale processing. The middle model has lower computational cost but worse accuracy, and it also lacks these features. In contrast, our framework achieves better accuracy with lower computational cost by incorporating the small receptive field detection block, domain adaptation, and multi-scale processing.
  • Figure 2: Framework Overview: In the training pipeline (both light and dark blue arrows), the multi-scale approach is applied to strengthen detection and tracking across various object sizes, while training is conducted using a large backbone and detection head. Moreover, the cross-domain detection loss is utilized during training to help improve the discriminability of features extracted from a more compact backbone architecture (details provided in Section \ref{['subsec:Domain Adaptation']}). In the inference pipeline (dark blue arrows), only the small/compact backbone is used to reduce computational cost and detection is performed using large head via the channel alignment layer.
  • Figure 3: Multi-scale module Overview: The input image is downsampled multiple times to process images at multiple scales: 640×640, 860×860, and 1080×1080. Each branch undergoes convolution operations, and the resulting feature maps are merged through a series of downsampling and concatenation steps. This ensures that features at various scales are effectively captured and integrated to enhance the detection of different sized objects.
  • Figure 4: Backbone architecture and head with the addition of a Small Receptive Field Detection (SRFD) block. The orange components are added to operate on early features from the base model (gray sections). The SRFD block enhances the detection of small objects, resulting in improved performance for tiny object detection.
  • Figure 5: Comparison of MOTA vs. FLOPs on the CroHD dataset for sequences HT21-01 to HT21-04, consisting of a total of 5741 frames. The YOLOv8 models are represented by blue stars, YOLOv8 + SRFD models by green squares, YOLOv8 + SRFD + CDDL models by purple crosses, and YOLOv8 + SRFD + CDDL + MS models by red diamonds. The results show the trade-off between MOTA and computational complexity (FLOPs) for different configurations.
  • ...and 1 more figures