Table of Contents
Fetching ...

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello

TL;DR

This work tackles multi-human behavior prediction from a third-person viewpoint by introducing CAMP-VLM, a Vision-Language Model framework that fuses visual context with 2D scene graphs to ground predictions in environmental structure. The method employs a two-stage fine-tuning pipeline—SFT with LoRA followed by Direct Preference Optimization—to adapt pre-trained VLMs for open-vocabulary future action prediction of multiple humans. Evaluated on synthetic data generated in VirtualHome and real-world sequences, CAMP-VLM achieves up to 66.9% improvement over strong baselines and demonstrates the value of explicit spatial grounding for noun-level accuracy. The results have practical implications for mobile robots and human-robot collaboration, highlighting the importance of context-aware, scene-grounded reasoning in dynamic human environments.

Abstract

Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

TL;DR

This work tackles multi-human behavior prediction from a third-person viewpoint by introducing CAMP-VLM, a Vision-Language Model framework that fuses visual context with 2D scene graphs to ground predictions in environmental structure. The method employs a two-stage fine-tuning pipeline—SFT with LoRA followed by Direct Preference Optimization—to adapt pre-trained VLMs for open-vocabulary future action prediction of multiple humans. Evaluated on synthetic data generated in VirtualHome and real-world sequences, CAMP-VLM achieves up to 66.9% improvement over strong baselines and demonstrates the value of explicit spatial grounding for noun-level accuracy. The results have practical implications for mobile robots and human-robot collaboration, highlighting the importance of context-aware, scene-grounded reasoning in dynamic human environments.

Abstract

Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

Paper Structure

This paper contains 23 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: CAMP-VLM is a VLM-based framework for Context-Aware Multi-human behavior Prediction. Receiving an image sequence of past observations from third-person views and a Scene Graph (SG) representing the environmental topologies (excluding the textual action labels), the two-stage fine-tuning process helps CAMP-VLM to more accurately predict multi-human behaviors.
  • Figure 2: An overview of CAMP-VLM, a VLM-centered framework for Context-Aware Multi-human behavior Prediction. The video frames are processed by the vision encoder into visual tokens, which are then passed into the Large Language Model (LLM) backbone via the projection layer. The context encoded in the images helps the VLM to discern interactions between humans and the scene. The scene knowledge encoded in the Scene Graph (SG) is provided to ground the predictions in the provided scene topologies and relationships. Under the guidance of the user-provided prompt, the LLM predicts human behaviors in the given format. The LLM is fine-tuned to improve the prediction performance, while the weights of the vision encoder and projection layer remain unchanged.
  • Figure 3: Data generation and fine-tuning process.
  • Figure 4: Example scenes of the datasets. From left to right: kitchen, living room, bedroom from VirtualHome simulation puig2018virtualhome, and office kitchen and living room in the real-world video recordings.