Table of Contents
Fetching ...

HMR-1: Hierarchical Massage Robot with Vision-Language-Model for Embodied Healthcare

Rongtao Xu, Mingming Yu, Xiaofeng Han, Yu Zhang, Kaiyi Hu, Zhe Feng, Zenghuang Fu, Changwei Wang, Weiliang Meng, Xiaopeng Zhang

TL;DR

This work constructs MedMassage-12K - a multimodal dataset containing 12,190 images with 174,177 QA pairs, covering diverse lighting conditions and backgrounds, and proposes a hierarchical embodied massage framework, which includes a high-level acupoint grounding module and a low-level control module.

Abstract

The rapid advancement of Embodied Intelligence has opened transformative opportunities in healthcare, particularly in physical therapy and rehabilitation. However, critical challenges remain in developing robust embodied healthcare solutions, such as the lack of standardized evaluation benchmarks and the scarcity of open-source multimodal acupoint massage datasets. To address these gaps, we construct MedMassage-12K - a multimodal dataset containing 12,190 images with 174,177 QA pairs, covering diverse lighting conditions and backgrounds. Furthermore, we propose a hierarchical embodied massage framework, which includes a high-level acupoint grounding module and a low-level control module. The high-level acupoint grounding module uses multimodal large language models to understand human language and identify acupoint locations, while the low-level control module provides the planned trajectory. Based on this, we evaluate existing MLLMs and establish a benchmark for embodied massage tasks. Additionally, we fine-tune the Qwen-VL model, demonstrating the framework's effectiveness. Physical experiments further confirm the practical applicability of the framework.Our dataset and code are publicly available at https://github.com/Xiaofeng-Han-Res/HMR-1.

HMR-1: Hierarchical Massage Robot with Vision-Language-Model for Embodied Healthcare

TL;DR

This work constructs MedMassage-12K - a multimodal dataset containing 12,190 images with 174,177 QA pairs, covering diverse lighting conditions and backgrounds, and proposes a hierarchical embodied massage framework, which includes a high-level acupoint grounding module and a low-level control module.

Abstract

The rapid advancement of Embodied Intelligence has opened transformative opportunities in healthcare, particularly in physical therapy and rehabilitation. However, critical challenges remain in developing robust embodied healthcare solutions, such as the lack of standardized evaluation benchmarks and the scarcity of open-source multimodal acupoint massage datasets. To address these gaps, we construct MedMassage-12K - a multimodal dataset containing 12,190 images with 174,177 QA pairs, covering diverse lighting conditions and backgrounds. Furthermore, we propose a hierarchical embodied massage framework, which includes a high-level acupoint grounding module and a low-level control module. The high-level acupoint grounding module uses multimodal large language models to understand human language and identify acupoint locations, while the low-level control module provides the planned trajectory. Based on this, we evaluate existing MLLMs and establish a benchmark for embodied massage tasks. Additionally, we fine-tune the Qwen-VL model, demonstrating the framework's effectiveness. Physical experiments further confirm the practical applicability of the framework.Our dataset and code are publicly available at https://github.com/Xiaofeng-Han-Res/HMR-1.
Paper Structure (18 sections, 6 figures, 4 tables)

This paper contains 18 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The proposed HMR framework. Given a textual instruction and an RGB-D observation, the model predicts the 6-DOF end-effector pose for robotic massage.
  • Figure 2: The bar chart on the left shows the distribution of acupoints in the test set and training set, while the right side displays images of the mannequin with acupoints under different lighting and background conditions.
  • Figure 3: System architecture of the proposed framework, consisting of a High-Level Grounding Module (HLGM) and a Low-Level Control Module (LLCM).
  • Figure 4: Massage in real-world environments: first-person RGB and third-person frames of robot execution.
  • Figure :
  • ...and 1 more figures