Moving object detection from multi-depth images with an attention-enhanced CNN
Masato Shibukawa, Fumi Yoshida, Toshifumi Yanagisawa, Takashi Ito, Hirohisa Kurosaki, Makoto Yoshikawa, Kohki Kamiya, Ji-an Jiang, Wesley Fraser, JJ Kavelaars, Susan Benecchi, Anne Verbiscer, Akira Hatakeyama, Hosei O, Naoya Ozaki
TL;DR
The paper tackles automated detection of faint moving objects in solar-system surveys by leveraging multi-depth stacked images as a 3D input to CNNs enhanced with Convolutional Block Attention Modules (CBAM). It systematically compares small and large CNN backbones, demonstrates that CBAM improves performance across architectures, and shows that multi-depth inputs (e.g., combining 32-frame and 4-frame stacks) with lightweight CNNs can outperform deeper networks in several metrics. Through extensive cross-validation and evaluation on unseen data, the work reports high accuracy (≈98.9%), F1 ≈99.3%, and AUC ≈0.99, while achieving up to 99.37% reduction in manual verification work with thresholding and ensemble decisions. The findings support deploying CBAM-enabled, multi-depth models in automated surveys (JAXA, NH, Subaru HSC) and have implications for scalable pipelines in upcoming large-scale observatories like LSST and SKA.
Abstract
One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.
