Table of Contents
Fetching ...

Inferring Intentions to Speak Using Accelerometer Data In-the-Wild

Litian Li, Jord Molhoek, Jing Zhou

TL;DR

This work investigates inferring intentions to speak from in-the-wild accelerometer data, using the REWIND dataset and framing the task as a binary classification of pre-speech windows. A privacy-preserving residual CNN is trained on successful speaking events and evaluated against annotated unsuccessful intents via $AUC$ on multiple time windows, revealing only modest predictive power above random chance. The study identifies cues such as posture shifts and audible mouth-opening patterns while acknowledging limitations from a small annotated sample and potential confounds, arguing that accelerometry alone is insufficient for reliable intention inference. It outlines future directions toward multimodal sensing and broader, culturally diverse data to achieve robust, real-world inference of speaking intentions.

Abstract

Humans have good natural intuition to recognize when another person has something to say. It would be interesting if an AI can also recognize intentions to speak. Especially in scenarios when an AI is guiding a group discussion, this can be a useful skill. This work studies the inference of successful and unsuccessful intentions to speak from accelerometer data. This is chosen because it is privacy-preserving and feasible for in-the-wild settings since it can be placed in a smart badge. Data from a real-life social networking event is used to train a machine-learning model that aims to infer intentions to speak. A subset of unsuccessful intention-to-speak cases in the data is annotated. The model is trained on the successful intentions to speak and evaluated on both the successful and unsuccessful cases. In conclusion, there is useful information in accelerometer data, but not enough to reliably capture intentions to speak. For example, posture shifts are correlated with intentions to speak, but people also often shift posture without having an intention to speak, or have an intention to speak without shifting their posture. More modalities are likely needed to reliably infer intentions to speak.

Inferring Intentions to Speak Using Accelerometer Data In-the-Wild

TL;DR

This work investigates inferring intentions to speak from in-the-wild accelerometer data, using the REWIND dataset and framing the task as a binary classification of pre-speech windows. A privacy-preserving residual CNN is trained on successful speaking events and evaluated against annotated unsuccessful intents via on multiple time windows, revealing only modest predictive power above random chance. The study identifies cues such as posture shifts and audible mouth-opening patterns while acknowledging limitations from a small annotated sample and potential confounds, arguing that accelerometry alone is insufficient for reliable intention inference. It outlines future directions toward multimodal sensing and broader, culturally diverse data to achieve robust, real-world inference of speaking intentions.

Abstract

Humans have good natural intuition to recognize when another person has something to say. It would be interesting if an AI can also recognize intentions to speak. Especially in scenarios when an AI is guiding a group discussion, this can be a useful skill. This work studies the inference of successful and unsuccessful intentions to speak from accelerometer data. This is chosen because it is privacy-preserving and feasible for in-the-wild settings since it can be placed in a smart badge. Data from a real-life social networking event is used to train a machine-learning model that aims to infer intentions to speak. A subset of unsuccessful intention-to-speak cases in the data is annotated. The model is trained on the successful intentions to speak and evaluated on both the successful and unsuccessful cases. In conclusion, there is useful information in accelerometer data, but not enough to reliably capture intentions to speak. For example, posture shifts are correlated with intentions to speak, but people also often shift posture without having an intention to speak, or have an intention to speak without shifting their posture. More modalities are likely needed to reliably infer intentions to speak.
Paper Structure (29 sections, 10 figures, 3 tables)

This paper contains 29 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Extraction of successful intention-to-speak cases.
  • Figure 2: Visualisation of the first three rows of table \ref{['tab:auc_results']}. Note that points are shifted slightly left or right to prevent overlap of the standard deviations.
  • Figure 3: Visualisation of the final two rows of table \ref{['tab:auc_results']}.
  • Figure 4: Datasets Comparison
  • Figure 5: Samples for different experiments. Experiment numbers are consistent with the order in table \ref{['tab:auc_results']} and \ref{['tab:ttest']}.
  • ...and 5 more figures