Table of Contents
Fetching ...

Count What You Want: Exemplar Identification and Few-shot Counting of Human Actions in the Wild

Yifeng Huang, Duc Duy Nguyen, Lam Nguyen, Cuong Pham, Minh Hoai

TL;DR

This work tackles counting specific human actions in wearable sensor data when the action class is not fixed. It introduces an exemplar-based framework where users vocalize predetermined counts ('one','two','three') to specify exemplars, then uses a multi-stage pipeline—exemplar extraction, per-window embeddings, exemplar similarity, exemplar-infused embedding, and density estimation—to produce a moment-by-moment density map whose sum yields the final count. Key contributions include a constrained exemplar extraction mechanism with dynamic programming, a distance-preserving loss to maintain embedding geometry, an exemplar-based data synthesis strategy, and a new Diverse Wearable Counting (DWC) dataset with synchronized audio and multi-modal data. Empirical results on DWC show the proposed method achieves substantially lower counting errors than frequency-based, RepNet, and TransRAC baselines, demonstrating strong generalization to unseen classes and subjects and highlighting practical viability for real-world wearable counting tasks.

Abstract

This paper addresses the task of counting human actions of interest using sensor data from wearable devices. We propose a novel exemplar-based framework, allowing users to provide exemplars of the actions they want to count by vocalizing predefined sounds ''one'', ''two'', and ''three''. Our method first localizes temporal positions of these utterances from the audio sequence. These positions serve as the basis for identifying exemplars representing the action class of interest. A similarity map is then computed between the exemplars and the entire sensor data sequence, which is further fed into a density estimation module to generate a sequence of estimated density values. Summing these density values provides the final count. To develop and evaluate our approach, we introduce a diverse and realistic dataset consisting of real-world data from 37 subjects and 50 action categories, encompassing both sensor and audio data. The experiments on this dataset demonstrate the viability of the proposed method in counting instances of actions from new classes and subjects that were not part of the training data. On average, the discrepancy between the predicted count and the ground truth value is 7.47, significantly lower than the errors of the frequency-based and transformer-based methods. Our project, code and dataset can be found at https://github.com/cvlab-stonybrook/ExRAC.

Count What You Want: Exemplar Identification and Few-shot Counting of Human Actions in the Wild

TL;DR

This work tackles counting specific human actions in wearable sensor data when the action class is not fixed. It introduces an exemplar-based framework where users vocalize predetermined counts ('one','two','three') to specify exemplars, then uses a multi-stage pipeline—exemplar extraction, per-window embeddings, exemplar similarity, exemplar-infused embedding, and density estimation—to produce a moment-by-moment density map whose sum yields the final count. Key contributions include a constrained exemplar extraction mechanism with dynamic programming, a distance-preserving loss to maintain embedding geometry, an exemplar-based data synthesis strategy, and a new Diverse Wearable Counting (DWC) dataset with synchronized audio and multi-modal data. Empirical results on DWC show the proposed method achieves substantially lower counting errors than frequency-based, RepNet, and TransRAC baselines, demonstrating strong generalization to unseen classes and subjects and highlighting practical viability for real-world wearable counting tasks.

Abstract

This paper addresses the task of counting human actions of interest using sensor data from wearable devices. We propose a novel exemplar-based framework, allowing users to provide exemplars of the actions they want to count by vocalizing predefined sounds ''one'', ''two'', and ''three''. Our method first localizes temporal positions of these utterances from the audio sequence. These positions serve as the basis for identifying exemplars representing the action class of interest. A similarity map is then computed between the exemplars and the entire sensor data sequence, which is further fed into a density estimation module to generate a sequence of estimated density values. Summing these density values provides the final count. To develop and evaluate our approach, we introduce a diverse and realistic dataset consisting of real-world data from 37 subjects and 50 action categories, encompassing both sensor and audio data. The experiments on this dataset demonstrate the viability of the proposed method in counting instances of actions from new classes and subjects that were not part of the training data. On average, the discrepancy between the predicted count and the ground truth value is 7.47, significantly lower than the errors of the frequency-based and transformer-based methods. Our project, code and dataset can be found at https://github.com/cvlab-stonybrook/ExRAC.
Paper Structure (13 sections, 3 equations, 5 figures, 3 tables)

This paper contains 13 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Processing pipeline of our method. The input consists of the sensor signal and the audio signal containing the utterances "one," "two," and "three," corresponding to three repetitions of the action of interest. The output is the total count, obtained by summing the values of the intermediate 1D density profile. This profile is better visualized as a 2D map as shown here. This figure also shows the other processing steps, which will be explained in the forthcoming method section.
  • Figure 2: Main steps of our method. Our method begins with exemplar extraction, which is based on predefined utterance detection in the audio data. Following this, per-window embeddings are extracted. Subsequently, we compute the similarity between the entire sensor sequence and the exemplars, which is then used for feature fusion. Finally, the temporal density map is estimated based on the fused features and the sensor embeddings.
  • Figure 3: DWC dataset's statistics: The left figure displays the action categories and the proportion of samples for each category in DWC. The two rightmost figures show the number of samples in various ranges of repetition count and duration.
  • Figure 4: Left: model's performance as the amount of pretraining data is increased; "2x" represents twice the size of the real training set. Right: Quantitative result on temporal location detection. Off-By-K error under varying K.
  • Figure 5: Qualitative results. Four prediction examples. Each example shows the input sensor data, the estimated density map, the predicted count, and the ground truth value.