Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal; Yuchen Liu; Luigi Palmieri; Ilche Georgievski; Marco Aiello

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello

TL;DR

This work tackles multi-human behavior prediction from a third-person viewpoint by introducing CAMP-VLM, a Vision-Language Model framework that fuses visual context with 2D scene graphs to ground predictions in environmental structure. The method employs a two-stage fine-tuning pipeline—SFT with LoRA followed by Direct Preference Optimization—to adapt pre-trained VLMs for open-vocabulary future action prediction of multiple humans. Evaluated on synthetic data generated in VirtualHome and real-world sequences, CAMP-VLM achieves up to 66.9% improvement over strong baselines and demonstrates the value of explicit spatial grounding for noun-level accuracy. The results have practical implications for mobile robots and human-robot collaboration, highlighting the importance of context-aware, scene-grounded reasoning in dynamic human environments.

Abstract

Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

TL;DR

Abstract

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)