Table of Contents
Fetching ...

Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

Tonko E. W. Bossen, Andreas Møgelmose, Ross Greer

TL;DR

Current vision-language models struggle to understand dynamic pedestrian gestures in autonomous driving. The study introduces two gesture datasets, ATG and ITGI, and evaluates edge-capable VLMs (VideoLLaMA2/3, Qwen2VL) using prompting across three evaluation schemes: Embedded Similarity, Classification, and Reconstruction, in a zero-shot setting. Results show caption similarity and gesture-classification performance lag far behind expert baselines, with average caption similarity under 0.59 and F1 scores of 0.14–0.39, while pose-reconstruction offers limited promise without more data. These findings underscore the need for more data, better prompting strategies, and pose-informed or hybrid approaches to enable reliable, safe cooperative autonomous driving.

Abstract

In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.

Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

TL;DR

Current vision-language models struggle to understand dynamic pedestrian gestures in autonomous driving. The study introduces two gesture datasets, ATG and ITGI, and evaluates edge-capable VLMs (VideoLLaMA2/3, Qwen2VL) using prompting across three evaluation schemes: Embedded Similarity, Classification, and Reconstruction, in a zero-shot setting. Results show caption similarity and gesture-classification performance lag far behind expert baselines, with average caption similarity under 0.59 and F1 scores of 0.14–0.39, while pose-reconstruction offers limited promise without more data. These findings underscore the need for more data, better prompting strategies, and pose-informed or hybrid approaches to enable reliable, safe cooperative autonomous driving.

Abstract

In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.

Paper Structure

This paper contains 17 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: In autonomous driving scenarios, navigation instructions may come from pedestrians' dynamic, nonverbal gestures. Interpreting and responding to such gestures is vital for safe autonomous driving.
  • Figure 2: Pedestrians from frame 71 from video 0153 of the COOOL dataset alshami2024coool. Zero-shot analysis of these pedestrians using VLMs fails to capture the significance of their gestures (or non-gestures) toward the traffic scene.
  • Figure 3: Frame 18 from the "Reverse" command gesture video. This frame's set was annotated with the caption, "The pedestrian is standing in front of the ego driver. They are facing their torso and head towards the ego driver. They are moving their flat palms back and forth towards the ego driver, gesturing for it to reverse. They are moving slightly to the side."
  • Figure 4: We validate the selected encoder and metric by testing their ability to illustrate the expected similarity trend. The 'Ideal' ratio trend is illustrated on the left, used to validate the most suitable metric. This involves comparing a target sentence to a series of rephrased versions with progressively decreasing similarity levels. Two target captions were formulated about the same hypothetical scenario (e.g., "A person signals the ego driver to stop, by putting their hand towards the ego driver."). Each similarity level contains two rephrases cross-validated against both target captions to reduce sentence noise. The rephrases span five levels of similarity: 'Extended' with additional information, which can be difficult to know it is irrelevant (noise) or incorrect (false positive) information (e.g., "A pedestrian raises their hand towards the ego driver to stop traffic. They are looking scared and in need of help."), 'Equivalent' with the same information (e.g., "A pedestrian raises their hand towards the ego driver to stop traffic."), 'Partial' with partially equivalent information (e.g., "A person raises their hand towards the ego driver."), 'Slight' which is missing important details (e.g., "A human gestures to the ego driver."), and 'Unrelated' information (e.g., "The sky is blue and the sun is shining.").
  • Figure 5: Cosine similarity with the generated captions relative to only the ground truth captions (\ref{['fig:metrics']}, left) and both the ground truth and expert captions (\ref{['fig:metrics_to_humanngt']}, right). They are visualized together to illustrate the decreasing standard deviation due to the denoising. In \ref{['fig:metrics']} VideoLLaMA2 and VideoLLaMA3 show a slightly higher Q1 than the expert-generated captions. However, experts' mean is 0.54, while the mean of VideoLLaMA2 is 0.50, and VideoLLaMA3 is 0.49. The mean of Qwen is 0.44. In \ref{['fig:metrics_to_humanngt']} Qwen is increased to 0.46, but VideoLLaMA2 and VideoLLaMA3 are decreased to 0.48 and 0.47. This again shows a diversity in the non-VLM captions, and there are multiple ways to describe a certain gesture.
  • ...and 4 more figures