Table of Contents
Fetching ...

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines

Xinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Xu Cao, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Wei An, Weidong Sheng, Li Liu

TL;DR

This work tackles the gap in benchmarks for visible-thermal tiny-object detection by introducing RGBT-Tiny, a large-scale, finely aligned RGB-T SOD dataset with 115 paired sequences, 93K frames, 1.2M annotations, and 7 categories across 8 scene types. It proposes Scale Adaptive Fitness (SAFit), a size-aware metric that blends IoU and NWD via a switch controlled by the GT bbox area, enabling robust evaluation across very small and large objects; SAFit loss further guides training by promoting size-aware optimization. The authors conduct extensive baselines (32 detectors across visible, thermal, and RGB-T paradigms) and demonstrate SAFit’s effectiveness for evaluation and training, highlighting the strengths of multimodal fusion in challenging conditions. The dataset and SAFit framework offer a solid foundation for advances in RGBT image fusion, detection, and tracking, with future directions including temporal modeling and weakly supervised learning.

Abstract

Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing/RGBT-Tiny.

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines

TL;DR

This work tackles the gap in benchmarks for visible-thermal tiny-object detection by introducing RGBT-Tiny, a large-scale, finely aligned RGB-T SOD dataset with 115 paired sequences, 93K frames, 1.2M annotations, and 7 categories across 8 scene types. It proposes Scale Adaptive Fitness (SAFit), a size-aware metric that blends IoU and NWD via a switch controlled by the GT bbox area, enabling robust evaluation across very small and large objects; SAFit loss further guides training by promoting size-aware optimization. The authors conduct extensive baselines (32 detectors across visible, thermal, and RGB-T paradigms) and demonstrate SAFit’s effectiveness for evaluation and training, highlighting the strengths of multimodal fusion in challenging conditions. The dataset and SAFit framework offer a solid foundation for advances in RGBT image fusion, detection, and tracking, with future directions including temporal modeling and weakly supervised learning.

Abstract

Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing/RGBT-Tiny.
Paper Structure (14 sections, 4 equations, 10 figures, 6 tables)

This paper contains 14 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Example frames of RGBT-Tiny dataset. Scenes (annotation / frame number) are shown on the top. Sequence-level attributes are shown at the bottom. Pink, green and yellow circles represent levels of light vision (i.e., H: high, M: medium, L: low, In: invisible), target size (i.e., Et: extremely tiny, T: tiny, S: small, M: medium, L: large) and annotation density (i.e., S: sparse, M: medium, D: dense).
  • Figure 2: (a1) Raw RGB image is aligned to (a2) thermal image to generate (a3) adjusted RGB image. (b) An illustration of disparity variations of dual lenses.
  • Figure 3: (a) Annotation numbers w.r.t. target categories in visible and thermal modalities. Numbers represent the proportion of each category in annotations. (b) Inner circle shows sequence numbers w.r.t. scene categories, and outer circle shows the light vision distribution of scenes. Numbers in the pie chart represent the number of sequences of each scene type. Numbers in the legend represent the proportion of each light vision in annotations.
  • Figure 4: (a) Average annotation number per frame (i.e., annotation density) of each sequence. Larger circle represents higher density, and different colors represent different scene types. ($x$,$y$,$z$) are the numbers of sequences w.r.t. density levels (i.e., sparse, medium, dense). (b) Size distribution of each target category. Lines with different colors represent different scale levels. Radius represents the annotation number, and the area under each color line represents the total annotation number of each scale level.
  • Figure 5: (a) An illustration of the pixel deviation between the center points of GT bbox and predicted bbox. (b) IoU-Deviation curves w.r.t different sizes of bboxes. (c)-(d) SAFit-Deviation curves under different $C$ values. The abscissa value represents the number of pixels deviation. The ordinate value represents the corresponding metric value. Note that, since the locations of bboxes can only change discretely, curves are presented as scatter diagrams.
  • ...and 5 more figures