Table of Contents
Fetching ...

HitoMi-Cam: A Shape-Agnostic Person Detection Method Using the Spectral Characteristics of Clothing

Shuji Ono

TL;DR

This work tackles the limitations of CNN-based person detectors that rely on shape cues and training data biases by proposing HitoMi-Cam, a shape-agnostic detector that uses spectral clothing signatures from four narrow bands. Implemented on a low-cost edge device (Raspberry Pi 5) with a 4-band multispectral camera, it achieves real-time processing (23.2 fps) and strong presence-detection performance in challenging scenarios, notably a simulated SAR setting where CNNs underperform (AP up to 93.5% vs 53.8% for the best CNN). The system employs offline training to produce a lightweight MLP, followed by pixel-wise classification to generate a clothing map and bounding boxes via post-processing, outputting a 1.0 confidence when clothing is detected. Overall, HitoMi-Cam complements traditional detectors by robustly detecting clothing materials in unpredictable postures and environments, offering practical value for disaster rescue and edge-enabled surveillance where shape-based methods struggle, while highlighting the need for integration with CNNs and further robustness enhancements.

Abstract

While convolutional neural network (CNN)-based object detection is widely used, it exhibits a shape dependency that degrades performance for postures not included in the training data. Building upon our previous simulation study published in this journal, this study implements and evaluates the spectral-based approach on physical hardware to address this limitation. Specifically, this paper introduces HitoMi-Cam, a lightweight and shape-agnostic person detection method that uses the spectral reflectance properties of clothing. The author implemented the system on a resource-constrained edge device without a GPU to assess its practical viability. The results indicate that a processing speed of 23.2 frames per second (fps) (253x190 pixels) is achievable, suggesting that the method can be used for real-time applications. In a simulated search and rescue scenario where the performance of CNNs declines, HitoMi-Cam achieved an average precision (AP) of 93.5%, surpassing that of the compared CNN models (best AP of 53.8%). Throughout all evaluation scenarios, the occurrence of false positives remained minimal. This study positions the HitoMi-Cam method not as a replacement for CNN-based detectors but as a complementary tool under specific conditions. The results indicate that spectral-based person detection can be a viable option for real-time operation on edge devices in real-world environments where shapes are unpredictable, such as disaster rescue.

HitoMi-Cam: A Shape-Agnostic Person Detection Method Using the Spectral Characteristics of Clothing

TL;DR

This work tackles the limitations of CNN-based person detectors that rely on shape cues and training data biases by proposing HitoMi-Cam, a shape-agnostic detector that uses spectral clothing signatures from four narrow bands. Implemented on a low-cost edge device (Raspberry Pi 5) with a 4-band multispectral camera, it achieves real-time processing (23.2 fps) and strong presence-detection performance in challenging scenarios, notably a simulated SAR setting where CNNs underperform (AP up to 93.5% vs 53.8% for the best CNN). The system employs offline training to produce a lightweight MLP, followed by pixel-wise classification to generate a clothing map and bounding boxes via post-processing, outputting a 1.0 confidence when clothing is detected. Overall, HitoMi-Cam complements traditional detectors by robustly detecting clothing materials in unpredictable postures and environments, offering practical value for disaster rescue and edge-enabled surveillance where shape-based methods struggle, while highlighting the need for integration with CNNs and further robustness enhancements.

Abstract

While convolutional neural network (CNN)-based object detection is widely used, it exhibits a shape dependency that degrades performance for postures not included in the training data. Building upon our previous simulation study published in this journal, this study implements and evaluates the spectral-based approach on physical hardware to address this limitation. Specifically, this paper introduces HitoMi-Cam, a lightweight and shape-agnostic person detection method that uses the spectral reflectance properties of clothing. The author implemented the system on a resource-constrained edge device without a GPU to assess its practical viability. The results indicate that a processing speed of 23.2 frames per second (fps) (253x190 pixels) is achievable, suggesting that the method can be used for real-time applications. In a simulated search and rescue scenario where the performance of CNNs declines, HitoMi-Cam achieved an average precision (AP) of 93.5%, surpassing that of the compared CNN models (best AP of 53.8%). Throughout all evaluation scenarios, the occurrence of false positives remained minimal. This study positions the HitoMi-Cam method not as a replacement for CNN-based detectors but as a complementary tool under specific conditions. The results indicate that spectral-based person detection can be a viable option for real-time operation on edge devices in real-world environments where shapes are unpredictable, such as disaster rescue.

Paper Structure

This paper contains 30 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Conceptual comparison of detection principles of HitoMi-Cam and CNN-based methods. (a) Conventional methods such as Convolutional Neural Networks (CNNs) depend on the "shape" patterns of people included in the training data. Therefore, they face the challenge of shape dependency, where performance degrades for postures not present in the training data (e.g., a fallen person). (b) HitoMi-Cam focuses on the physical spectral reflectance characteristics (spectral signatures) of clothing materials rather than the shape of the object. By classifying each pixel independently, it aims for shape-agnostic detection that does not depend on the target's posture or orientation.
  • Figure 2: Overall system architecture of HitoMi-Cam. The system consists of two tiers: (a) an offline learning phase and (b) an online inference phase. In (a), based on the methods and datasets established in the author's previous research ono2025jimaging, a 4-band selection is made from hyperspectral data on a GPU-equipped PC, a multi-layer perceptron (MLP) model is trained, and an Open Neural Network Exchange (ONNX) format inference model is generated. In (b), the HitoMi-Cam prototype, equipped with the generated inference model, performs person detection on a Raspberry Pi for real-time input from a 4-band camera.
  • Figure 3: Physical configuration of the HitoMi-Cam system. The system consists of a single-board computer (Raspberry Pi 5) as the host computer and a commercially available compound-eye camera module (PiTOMBO; Asahi Electronics Laboratory Co., Ltd., Osaka, Japan). As shown, the camera module is equipped with four optical bandpass filters and can simultaneously acquire a multispectral image with four spectral bands (central wavelengths 457, 565, 645, and 735 nm) in a single shot.
  • Figure 4: Processing pipeline from 4-band multispectral image acquisition to clothing map generation in HitoMi-Cam. The process starts with the acquisition of a 4-band multispectral image. After the luminance vector of each pixel is extracted, it is input into a lightweight MLP. The MLP independently classifies each pixel as "clothing" or "non-clothing" to generate an initial clothing map. Subsequently, post-processing using OpenCV (noise removal and morphological operations) is applied to finalize the clothing regions and calculate the final bounding boxes.
  • Figure 5: Example of entire processing sequence from 4-band image acquisition to final bounding box generation. (a) Raw 4-band images captured by camera. (b) Color composite image synthesized for easy viewing. The band assignments are R=645, G=565, and B=457 nm. (c) Initial map of pixels classified as "clothing" by MLP. (d) Map after noise removal. (e) Map after morphological operations are applied to define continuous regions. (f) Final bounding box calculated for defined clothing region.
  • ...and 16 more figures