Table of Contents
Fetching ...

Moving object detection from multi-depth images with an attention-enhanced CNN

Masato Shibukawa, Fumi Yoshida, Toshifumi Yanagisawa, Takashi Ito, Hirohisa Kurosaki, Makoto Yoshikawa, Kohki Kamiya, Ji-an Jiang, Wesley Fraser, JJ Kavelaars, Susan Benecchi, Anne Verbiscer, Akira Hatakeyama, Hosei O, Naoya Ozaki

TL;DR

The paper tackles automated detection of faint moving objects in solar-system surveys by leveraging multi-depth stacked images as a 3D input to CNNs enhanced with Convolutional Block Attention Modules (CBAM). It systematically compares small and large CNN backbones, demonstrates that CBAM improves performance across architectures, and shows that multi-depth inputs (e.g., combining 32-frame and 4-frame stacks) with lightweight CNNs can outperform deeper networks in several metrics. Through extensive cross-validation and evaluation on unseen data, the work reports high accuracy (≈98.9%), F1 ≈99.3%, and AUC ≈0.99, while achieving up to 99.37% reduction in manual verification work with thresholding and ensemble decisions. The findings support deploying CBAM-enabled, multi-depth models in automated surveys (JAXA, NH, Subaru HSC) and have implications for scalable pipelines in upcoming large-scale observatories like LSST and SKA.

Abstract

One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.

Moving object detection from multi-depth images with an attention-enhanced CNN

TL;DR

The paper tackles automated detection of faint moving objects in solar-system surveys by leveraging multi-depth stacked images as a 3D input to CNNs enhanced with Convolutional Block Attention Modules (CBAM). It systematically compares small and large CNN backbones, demonstrates that CBAM improves performance across architectures, and shows that multi-depth inputs (e.g., combining 32-frame and 4-frame stacks) with lightweight CNNs can outperform deeper networks in several metrics. Through extensive cross-validation and evaluation on unseen data, the work reports high accuracy (≈98.9%), F1 ≈99.3%, and AUC ≈0.99, while achieving up to 99.37% reduction in manual verification work with thresholding and ensemble decisions. The findings support deploying CBAM-enabled, multi-depth models in automated surveys (JAXA, NH, Subaru HSC) and have implications for scalable pipelines in upcoming large-scale observatories like LSST and SKA.

Abstract

One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.

Paper Structure

This paper contains 40 sections, 5 equations, 22 figures, 6 tables.

Figures (22)

  • Figure 1: An example of images generated by our moving object detection system. (a) Raw observation images. (b) 4-frame stacked images. (c) 8-frame stacked images. (d) 16-frame stacked images. (e) A 32-frame stacked image. Alt text: Figure with five panels showing astronomical cutouts. Panel a is a single noisy raw image. Panels b to e are stacks from 4, 8, 16, and 32 frames; noise decreases and faint sources become progressively visible.
  • Figure 2: Block diagram of input data construction: (1) output from the JAXA detection system, (2) intermediate stacked images generated from 4, 8, 16, and 32 frames, and (3) input tensor formed by aligning the 4-frame and 32-frame stacked images along the channel axis. Alt text: Three-step flow diagram. First, outputs from the JAXA detection system are taken as inputs. Second, stacks are produced at 4, 8, 16, and 32 frames. Third, the 4-frame and 32-frame stacks are combined along the channel axis to form the model input tensor.
  • Figure 3: Architectures of the small-scale CNNs with CBAM. Each CNN-CBAM layer consists of one convolutional layer, one pooling layer, and one CBAM module. The models contain 2 or 4 such layers. Alt text: Block diagrams of compact convolutional networks where each block includes convolution, pooling, and CBAM; variants stack two or four blocks before a classifier head producing a binary probability of object presence.
  • Figure 4: ResNet block architecture with CBAM. The attention module is inserted between residual blocks. Alt text: Schematic of a residual block with a shortcut connection; CBAM is placed between consecutive residual blocks to compute channel and spatial weights that are multiplied with the intermediate feature maps.
  • Figure 5: Convolutional Block Attention Module (CBAM) architecture. Alt text: Diagram of CBAM showing sequential channel attention and spatial attention applied to a feature map; each attention produces a weight map that is multiplied element-wise with the features to refine salient information.
  • ...and 17 more figures