Table of Contents
Fetching ...

SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance

Muhammad Saif Ullah Khan, Muhammad Zeshan Afzal, Didier Stricker

TL;DR

SituationalLLM addresses the gap where generic LLMs struggle to provide actionable guidance in real-world environments due to missing physical context. It encodes environmental understanding as a Scene Graph Language and trains on the SAD-Instruct dataset, which combines structured scene graphs with multi-agent dialogue to produce grounded, step-by-step guidance. A LoRA-based fine-tuning of LLaMA-3-8b-Instruct enables efficient integration of scene-graph information and interactive clarification during task guidance. Qualitative results show improved task specificity, reliability, and adaptability compared with baselines like GPT-4, demonstrating the potential for robust, environment-aware AI assistants in naturalistic settings.

Abstract

Large language models (LLMs) have achieved remarkable success in text-based tasks but often struggle to provide actionable guidance in real-world physical environments. This is because of their inability to recognize their limited understanding of the user's physical context. We present SituationalLLM, a novel approach that integrates structured scene information into an LLM to deliver proactive, context-aware assistance. By encoding objects, attributes, and relationships in a custom Scene Graph Language, SituationalLLM actively identifies gaps in environmental context and seeks clarifications during user interactions. This behavior emerges from training on the Situational Awareness Database for Instruct-Tuning (SAD-Instruct), which combines diverse, scenario-specific scene graphs with iterative, dialogue-based refinements. Experimental results indicate that SituationalLLM outperforms generic LLM baselines in task specificity, reliability, and adaptability, paving the way for environment-aware AI assistants capable of delivering robust, user-centric guidance under real-world constraints.

SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance

TL;DR

SituationalLLM addresses the gap where generic LLMs struggle to provide actionable guidance in real-world environments due to missing physical context. It encodes environmental understanding as a Scene Graph Language and trains on the SAD-Instruct dataset, which combines structured scene graphs with multi-agent dialogue to produce grounded, step-by-step guidance. A LoRA-based fine-tuning of LLaMA-3-8b-Instruct enables efficient integration of scene-graph information and interactive clarification during task guidance. Qualitative results show improved task specificity, reliability, and adaptability compared with baselines like GPT-4, demonstrating the potential for robust, environment-aware AI assistants in naturalistic settings.

Abstract

Large language models (LLMs) have achieved remarkable success in text-based tasks but often struggle to provide actionable guidance in real-world physical environments. This is because of their inability to recognize their limited understanding of the user's physical context. We present SituationalLLM, a novel approach that integrates structured scene information into an LLM to deliver proactive, context-aware assistance. By encoding objects, attributes, and relationships in a custom Scene Graph Language, SituationalLLM actively identifies gaps in environmental context and seeks clarifications during user interactions. This behavior emerges from training on the Situational Awareness Database for Instruct-Tuning (SAD-Instruct), which combines diverse, scenario-specific scene graphs with iterative, dialogue-based refinements. Experimental results indicate that SituationalLLM outperforms generic LLM baselines in task specificity, reliability, and adaptability, paving the way for environment-aware AI assistants capable of delivering robust, user-centric guidance under real-world constraints.
Paper Structure (28 sections, 13 figures, 1 table)

This paper contains 28 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: GPT-4 provides comprehensive but generic guidance when assisting with physical tasks, failing to account for specific user situations and constraints. It presumes that the jar is "stubborn" and neglects to ask for details like the type of jar or the user's limitations, which can lead to less applicable advice. An ideal LLM-driven AI assistant should provide tailored advice, considering the user's real-world situation, implying a need for awareness of their physical context.
  • Figure 2: Methodology and scenario diversity
  • Figure 3: Pruned scene graphs. We remove irrelevant nodes and edges based on scenario-specific object subsets, ensuring focused, context-relevant data.
  • Figure 4: Effectiveness of using scenario-specific scene graphs. Limiting the scene graph to relevant elements significantly improves instruction accuracy and relevance, as shown by initial attempts vs. refined outputs.
  • Figure 5: Dialogue Pipeline. Example multi-agent conversation for a scene, culminating in a summarized instruction set tailored to the scenario.
  • ...and 8 more figures