SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance
Muhammad Saif Ullah Khan, Muhammad Zeshan Afzal, Didier Stricker
TL;DR
SituationalLLM addresses the gap where generic LLMs struggle to provide actionable guidance in real-world environments due to missing physical context. It encodes environmental understanding as a Scene Graph Language and trains on the SAD-Instruct dataset, which combines structured scene graphs with multi-agent dialogue to produce grounded, step-by-step guidance. A LoRA-based fine-tuning of LLaMA-3-8b-Instruct enables efficient integration of scene-graph information and interactive clarification during task guidance. Qualitative results show improved task specificity, reliability, and adaptability compared with baselines like GPT-4, demonstrating the potential for robust, environment-aware AI assistants in naturalistic settings.
Abstract
Large language models (LLMs) have achieved remarkable success in text-based tasks but often struggle to provide actionable guidance in real-world physical environments. This is because of their inability to recognize their limited understanding of the user's physical context. We present SituationalLLM, a novel approach that integrates structured scene information into an LLM to deliver proactive, context-aware assistance. By encoding objects, attributes, and relationships in a custom Scene Graph Language, SituationalLLM actively identifies gaps in environmental context and seeks clarifications during user interactions. This behavior emerges from training on the Situational Awareness Database for Instruct-Tuning (SAD-Instruct), which combines diverse, scenario-specific scene graphs with iterative, dialogue-based refinements. Experimental results indicate that SituationalLLM outperforms generic LLM baselines in task specificity, reliability, and adaptability, paving the way for environment-aware AI assistants capable of delivering robust, user-centric guidance under real-world constraints.
