Table of Contents
Fetching ...

GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM

Keshav Bimbraw, Ye Wang, Jing Liu, Toshiaki Koike-Akino

TL;DR

This work investigates whether a large vision-language model, GPT-4o, can decode hand gestures from forearm ultrasound images without fine-tuning. By encoding ultrasound frames as text and applying few-shot in-context learning, the authors show that GPT-4o achieves notable gesture classification accuracy, with within-session results reaching about 74% after 2 training examples and cross-session results around 61% with 3 examples. The study demonstrates the practical potential of LVLMs for medical imaging tasks where fine-tuning is costly, and it highlights the influence of prompts, reasoning capabilities, and input formats on performance. Overall, the results suggest that LVLMs can serve as effective, label-efficient tools for ultrasound-based gesture interpretation and human–machine interfaces, motivating further cross-subject validation and comparisons with retrieval-based and PEFT approaches.

Abstract

Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their capability without fine-tuning is often limited in specialized tasks. However, full fine-tuning of large foundation models is challenging due to enormous computation/memory/dataset requirements. We show that GPT-4o can decode hand gestures from forearm ultrasound data even with no fine-tuning, and improves with few-shot, in-context learning.

GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM

TL;DR

This work investigates whether a large vision-language model, GPT-4o, can decode hand gestures from forearm ultrasound images without fine-tuning. By encoding ultrasound frames as text and applying few-shot in-context learning, the authors show that GPT-4o achieves notable gesture classification accuracy, with within-session results reaching about 74% after 2 training examples and cross-session results around 61% with 3 examples. The study demonstrates the practical potential of LVLMs for medical imaging tasks where fine-tuning is costly, and it highlights the influence of prompts, reasoning capabilities, and input formats on performance. Overall, the results suggest that LVLMs can serve as effective, label-efficient tools for ultrasound-based gesture interpretation and human–machine interfaces, motivating further cross-subject validation and comparisons with retrieval-based and PEFT approaches.

Abstract

Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their capability without fine-tuning is often limited in specialized tasks. However, full fine-tuning of large foundation models is challenging due to enormous computation/memory/dataset requirements. We show that GPT-4o can decode hand gestures from forearm ultrasound data even with no fine-tuning, and improves with few-shot, in-context learning.
Paper Structure (35 sections, 9 figures, 3 tables)

This paper contains 35 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Conversation with GPT-4o that motivated us to use the VLM for ultrasound image decoding.
  • Figure 2: Hand gestures (a through e) and the corresponding forearm ultrasound image (f through j) from subject 1. (a) and (f): Index flexion; (b) and (g): all pinch; (c) and (h) hand horns; (d) and (i) fist; (e) and (j): open hand.
  • Figure 3: Conversation with GPT-4o for forearm ultrasound classification based on 1-shot learning.
  • Figure 4: Confusion matrices for within-session (a--d), cross-session (e--h), and randomized cross-session (i--l) experiments summed over the three subjects for: 0-shot (a, e, and i), 1-shot (b, f, and j), 2-shot (c, g, and k), and 3-shot (d, h, and l) strategies.
  • Figure 5: Confusion matrices with different prompts (within-session, subject 1, 1-shot).
  • ...and 4 more figures