Table of Contents
Fetching ...

HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

Chengwen Zhang, Chun Yu, Borong Zhuang, Haopeng Jin, Qingyang Wan, Zhuojun Li, Zhe He, Zhoutong Ye, Yu Mei, Chang Liu, Weinan Shi, Yuanchun Shi

Abstract

Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI.

HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

Abstract

Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI.
Paper Structure (77 sections, 7 equations, 27 figures, 3 tables)

This paper contains 77 sections, 7 equations, 27 figures, 3 tables.

Figures (27)

  • Figure 1: Demonstration of HiSync. Multiple people perform similar gestures at a distance of 34 m. The figure illustrates the application scenario: Person 1 (blue link) controls the quadruped, while Person 2 (orange link) simultaneously controls the drone. Other bystanders act as visual distractors. Each robot receives an inertial stream from its paired device. HiSync enables the robot to identify its bound command issuer, effectively rejecting distractors regardless of their kinematic similarity.
  • Figure 2: Visual Ambiguity at a Distance of 34 m. Figure (A) shows a real sample from our dataset (1920 $\times$ 1080 resolution). The inset highlights that the hand region occupies fewer than 10 $\times$ 10 pixels. Even with a manually zoomed-in view of the detection result like (B), YOLOv11x Jocher_Ultralytics_YOLO_2023 fails to identify the hand. This demonstrates the inherent visual ambiguity in long-range interactions.
  • Figure 3: Apparatus of Formative Study. (A) Three robot forms used in the study. (B) Example of a participant performing a gesture toward a quadruped robot.
  • Figure 4: Command Vocabulary Proposed by Participants.
  • Figure 5: Illustration of the User-defined Gesture Set.
  • ...and 22 more figures