XS-VID: An Extremely Small Video Object Detection Dataset
Jiahao Guo, Ziyang Xu, Lianjun Wu, Fei Gao, Wenyu Liu, Xinggang Wang
TL;DR
This work targets the scarcity of robust SVOD benchmarks for extremely small objects by introducing XS-VID, a high-resolution aerial dataset with $12{,}230$ images across $38$ videos, eight categories, and object sizes spanning $0\sim32^2$ with ranges $0\sim12^2$, $12^2\sim20^2$, and $20^2\sim32^2$. To address detection challenges on such tiny objects, the authors propose YOLOFT, a YOLOv8-based SVOD detector equipped with a Multi-Scale Spatio-Temporal Flow (MSTF) module that fuses temporal motion cues with static features via a correlation pyramid and a GRU-based updater. Experiments on XS-VID and VisDrone2019VID show YOLOFT achieving state-of-the-art performance for extremely small objects, highlighting the limitations of existing VOD/SOD approaches in SVOD settings and the value of incorporating motion information at multiple scales. The XS-VID dataset, along with its benchmarks and YOLOFT, provides a new, challenging testbed for tiny-object video detection and is released to spur further research in small-object SVOD. Overall, the paper advances dataset construction for extreme-scale object detection in videos and delivers a practical, real-time method that leverages temporal cues to improve robustness and accuracy on tiny targets.
Abstract
Small Video Object Detection (SVOD) is a crucial subfield in modern computer vision, essential for early object discovery and detection. However, existing SVOD datasets are scarce and suffer from issues such as insufficiently small objects, limited object categories, and lack of scene diversity, leading to unitary application scenarios for corresponding methods. To address this gap, we develop the XS-VID dataset, which comprises aerial data from various periods and scenes, and annotates eight major object categories. To further evaluate existing methods for detecting extremely small objects, XS-VID extensively collects three types of objects with smaller pixel areas: extremely small (\textit{es}, $0\sim12^2$), relatively small (\textit{rs}, $12^2\sim20^2$), and generally small (\textit{gs}, $20^2\sim32^2$). XS-VID offers unprecedented breadth and depth in covering and quantifying minuscule objects, significantly enriching the scene and object diversity in the dataset. Extensive validations on XS-VID and the publicly available VisDrone2019VID dataset show that existing methods struggle with small object detection and significantly underperform compared to general object detectors. Leveraging the strengths of previous methods and addressing their weaknesses, we propose YOLOFT, which enhances local feature associations and integrates temporal motion features, significantly improving the accuracy and stability of SVOD. Our datasets and benchmarks are available at \url{https://gjhhust.github.io/XS-VID/}.
