XS-VID: An Extremely Small Video Object Detection Dataset

Jiahao Guo; Ziyang Xu; Lianjun Wu; Fei Gao; Wenyu Liu; Xinggang Wang

XS-VID: An Extremely Small Video Object Detection Dataset

Jiahao Guo, Ziyang Xu, Lianjun Wu, Fei Gao, Wenyu Liu, Xinggang Wang

TL;DR

This work targets the scarcity of robust SVOD benchmarks for extremely small objects by introducing XS-VID, a high-resolution aerial dataset with $12{,}230$ images across $38$ videos, eight categories, and object sizes spanning $0\sim32^2$ with ranges $0\sim12^2$, $12^2\sim20^2$, and $20^2\sim32^2$. To address detection challenges on such tiny objects, the authors propose YOLOFT, a YOLOv8-based SVOD detector equipped with a Multi-Scale Spatio-Temporal Flow (MSTF) module that fuses temporal motion cues with static features via a correlation pyramid and a GRU-based updater. Experiments on XS-VID and VisDrone2019VID show YOLOFT achieving state-of-the-art performance for extremely small objects, highlighting the limitations of existing VOD/SOD approaches in SVOD settings and the value of incorporating motion information at multiple scales. The XS-VID dataset, along with its benchmarks and YOLOFT, provides a new, challenging testbed for tiny-object video detection and is released to spur further research in small-object SVOD. Overall, the paper advances dataset construction for extreme-scale object detection in videos and delivers a practical, real-time method that leverages temporal cues to improve robustness and accuracy on tiny targets.

Abstract

Small Video Object Detection (SVOD) is a crucial subfield in modern computer vision, essential for early object discovery and detection. However, existing SVOD datasets are scarce and suffer from issues such as insufficiently small objects, limited object categories, and lack of scene diversity, leading to unitary application scenarios for corresponding methods. To address this gap, we develop the XS-VID dataset, which comprises aerial data from various periods and scenes, and annotates eight major object categories. To further evaluate existing methods for detecting extremely small objects, XS-VID extensively collects three types of objects with smaller pixel areas: extremely small (\textit{es}, $0\sim12^2$), relatively small (\textit{rs}, $12^2\sim20^2$), and generally small (\textit{gs}, $20^2\sim32^2$). XS-VID offers unprecedented breadth and depth in covering and quantifying minuscule objects, significantly enriching the scene and object diversity in the dataset. Extensive validations on XS-VID and the publicly available VisDrone2019VID dataset show that existing methods struggle with small object detection and significantly underperform compared to general object detectors. Leveraging the strengths of previous methods and addressing their weaknesses, we propose YOLOFT, which enhances local feature associations and integrates temporal motion features, significantly improving the accuracy and stability of SVOD. Our datasets and benchmarks are available at \url{https://gjhhust.github.io/XS-VID/}.

XS-VID: An Extremely Small Video Object Detection Dataset

TL;DR

This work targets the scarcity of robust SVOD benchmarks for extremely small objects by introducing XS-VID, a high-resolution aerial dataset with

images across

videos, eight categories, and object sizes spanning

with ranges

, and

. To address detection challenges on such tiny objects, the authors propose YOLOFT, a YOLOv8-based SVOD detector equipped with a Multi-Scale Spatio-Temporal Flow (MSTF) module that fuses temporal motion cues with static features via a correlation pyramid and a GRU-based updater. Experiments on XS-VID and VisDrone2019VID show YOLOFT achieving state-of-the-art performance for extremely small objects, highlighting the limitations of existing VOD/SOD approaches in SVOD settings and the value of incorporating motion information at multiple scales. The XS-VID dataset, along with its benchmarks and YOLOFT, provides a new, challenging testbed for tiny-object video detection and is released to spur further research in small-object SVOD. Overall, the paper advances dataset construction for extreme-scale object detection in videos and delivers a practical, real-time method that leverages temporal cues to improve robustness and accuracy on tiny targets.

Abstract

), relatively small (\textit{rs},

), and generally small (\textit{gs},

). XS-VID offers unprecedented breadth and depth in covering and quantifying minuscule objects, significantly enriching the scene and object diversity in the dataset. Extensive validations on XS-VID and the publicly available VisDrone2019VID dataset show that existing methods struggle with small object detection and significantly underperform compared to general object detectors. Leveraging the strengths of previous methods and addressing their weaknesses, we propose YOLOFT, which enhances local feature associations and integrates temporal motion features, significantly improving the accuracy and stability of SVOD. Our datasets and benchmarks are available at \url{https://gjhhust.github.io/XS-VID/}.

Paper Structure (19 sections, 2 equations, 10 figures, 14 tables)

This paper contains 19 sections, 2 equations, 10 figures, 14 tables.

Introduction
Related Work
The XS-VID Dataset
Data Collection and Annotation
Statistical Analysis
YOLOFT
Experiments
Comparison with State-of-the-art Detection Methods
Design Considerations for Small Video Object Detection
Conclusion
XS-VID Data Details
Data Splits
XS-VID Object Distribution
XS-VID Movement Attribute
Ablation Studies on YOLOFT
...and 4 more sections

Figures (10)

Figure 1: Showcases of our XS-VID dataset's object size and challenges in SVOD. (a) shows that the objects in our XS-VID dataset are extremely small, and (b) indicates that SVOD mainly faces three challenges: background confusion, misclassification, and texture distortion.
Figure 2: Comparison of object size distribution between XS-VID and other datasets. Our XS-VID generally has smaller object sizes.
Figure 3: AP-Latency comparison of various methods on our XS-VID. Our YOLOFT achieves the SOTA performance.
Figure 4: Quantitative Comparison between XS-VID and various datasets. The statistical results demonstrate that XS-VID has the highest number of extremely/extremely small objects and the widest area distribution, with a rich and balanced number of objects per frame. $\star$ in (b) specifically indicates a high-density aggregation of a particular object.
Figure 5: Overall architecture of our YOLOFT. Multi-Scale Spatio-Temporal Flow (MSTF) module maintains the optical flow information between consecutive frames and iteratively updates it. Based on this, it extracts multi-scale motion features of the object and integrates them into the static features of the current frame.
...and 5 more figures

XS-VID: An Extremely Small Video Object Detection Dataset

TL;DR

Abstract

XS-VID: An Extremely Small Video Object Detection Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (10)