Table of Contents
Fetching ...

Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments

Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu

TL;DR

This work introduces the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints, and demonstrates that statistically distinguishable differences persist between Human and AI motion.

Abstract

We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.

Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments

TL;DR

This work introduces the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints, and demonstrates that statistically distinguishable differences persist between Human and AI motion.

Abstract

We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
Paper Structure (9 sections, 1 equation, 3 figures, 2 tables)

This paper contains 9 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Results confirming the effectiveness of pretraining on AI generated data using MotionLCM dai2024motionlcm. While the choice of architecture is important, most models show significant improvements over the baseline that generally increases with the number of synthetic samples seen - 9k, 45k or 90k, from left to right.
  • Figure 2: The prompts used to condition SORA (first 3 columns) and VEO (last column) to generation videos depicting each of our proposed emotions.
  • Figure 3: "In-the-wild" evaluation (using MAE) for motion prediction on both synthetic and real data of baseline models (first row) and pretrained models (Pretrain-90k, second row), sampled every 10 epochs. The hierarchy is mostly kept regardless of the architecture or pretraining data. Remarkably, VEO has the lowest error. This suggests that recent large models are more predictable and that specific models might be identified by the way they predict keypoints.