Table of Contents
Fetching ...

Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Masahiro Yasuda, Noboru Harada, Yasunori Ohishi, Shoichiro Saito, Akira Nakayama, Nobutaka Ono

TL;DR

The paper tackles distributed multimedia sensor event analysis (DiMSEA), where events are inferred from fragmented observations across distributed cameras and microphones under weak labels. It introduces Guided-MELD, a masked self-distillation framework that trains an encoder to produce a joint embedding consistent across sensor subsets while preserving sufficient event information, guided by downstream task performance. The approach combines sensor masking, a mean-teacher embedding, and MultiTrans-based fusion, optimizing both a downstream event analysis loss and a distillation loss to suppress background information. Empirical results on MM-Store and MM-Office show Guided-MELD outperforms CRF, MultiTrans, and MSM baselines, with strong robustness to sensor reduction, highlighting practical potential for scalable, resilient distributed sensor systems.

Abstract

Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose Guided Masked sELf-Distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to enable the system to effectively distill the fragmented or redundant target event information obtained by the sensors without being overly dependent on any specific sensors. To validate the effectiveness of the proposed method in novel tasks of distributed multimedia sensor event analysis, we recorded two new datasets that fit the problem setting: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results on these datasets show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.

Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

TL;DR

The paper tackles distributed multimedia sensor event analysis (DiMSEA), where events are inferred from fragmented observations across distributed cameras and microphones under weak labels. It introduces Guided-MELD, a masked self-distillation framework that trains an encoder to produce a joint embedding consistent across sensor subsets while preserving sufficient event information, guided by downstream task performance. The approach combines sensor masking, a mean-teacher embedding, and MultiTrans-based fusion, optimizing both a downstream event analysis loss and a distillation loss to suppress background information. Empirical results on MM-Store and MM-Office show Guided-MELD outperforms CRF, MultiTrans, and MSM baselines, with strong robustness to sensor reduction, highlighting practical potential for scalable, resilient distributed sensor systems.

Abstract

Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose Guided Masked sELf-Distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to enable the system to effectively distill the fragmented or redundant target event information obtained by the sensors without being overly dependent on any specific sensors. To validate the effectiveness of the proposed method in novel tasks of distributed multimedia sensor event analysis, we recorded two new datasets that fit the problem setting: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results on these datasets show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.
Paper Structure (27 sections, 20 equations, 7 figures, 5 tables)

This paper contains 27 sections, 20 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An example of the observation for the distributed multimedia sensor event analysis task in our dataset: MM-Store. This example shows a scene in which a store staff member disinfects a store with an alcohol spray while a customer enters through the entrance. Camera 3 clearly captures the clerk, but his hand is difficult to see. On the other hand, microphone 4 clearly captures the sound of the alcohol spray being sprayed. The distributed multimedia sensor event analysis task requires distilling useful information for event identification from such observations. This figure also shows the room and sensor setup for the MM-Store dataset (details are given in Section V-A).
  • Figure 2: Variations of sensor placement for multi-sensor-based tasks.
  • Figure 3: A schematic view of a desirable encoder for DiMSEA. In this example, the entire scene is observed by five distributed sensors. The time index is omitted here for simplicity.
  • Figure 4: The architecture of Guided-MELD for distributed multi-modal event detection tasks. The N-channel camera and M-channel microphone input signals are divided into $T$-frames ($\tau$ is the index of time frame), and the event classifier output $\hat{\bm{p}}_{\tau}$ is computed in parallel. See sec. \ref{['sec:impl']} for details of each block. For more details on video and audio input shaping, see Fig.\ref{['fig:inputshaping']}.
  • Figure 5: Room and sensor setup of MM-Office dataset
  • ...and 2 more figures