Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering

Vishnu Sashank Dorbala; Prasoon Goyal; Robinson Piramuthu; Michael Johnston; Reza Ghanadhan; Dinesh Manocha

Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering

Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Reza Ghanadhan, Dinesh Manocha

TL;DR

The paper introduces Situational Embodied Question Answering (S-EQA) and a Prompt-Generate-Evaluate (PGE) framework to generate and validate queries requiring consensus on multiple object states. Using VirtualHome, it generates ~2000 situational datapoints, validated by MTurk with 97.26% answerable, but finds LLMs struggle to answer or reason about these queries, achieving only 46.2% correlation with human ground truth. The work also demonstrates real-world transfer challenges, with LLM hallucinations when a grounded scene graph is unavailable, highlighting the need for stronger multimodal grounding and uncertainty modeling. Overall, the study advances embodied AI by formalizing situational queries, proposing a generative data creation pipeline, and outlining directions to improve real-world usability of embodied agents.

Abstract

We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and properties ("What is the color of the car?"), situational queries (such as "Is the house ready for sleeptime?") are challenging as they require the agent to correctly identify multiple object-states (Doors: Closed, Lights: Off, etc.) and reach a consensus on their states for an answer. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to generate unique situational queries and corresponding consensus object information. PGE is used to generate 2K datapoints in the VirtualHome simulator, which is then annotated for ground truth answers via a large scale user-study conducted on M-Turk. With a high rate of answerability (97.26%) on this study, we establish that LLMs are good at generating situational data. However, in evaluating the data using an LLM, we observe a low correlation of 46.2% with the ground truth human annotations; indicating that while LLMs are good at generating situational data, they struggle to answer them according to consensus. When asked for reasoning, we observe the LLM often goes against commonsense in justifying its answer. Finally, we utilize PGE to generate situational data in a real-world environment, exposing LLM hallucination in generating reliable object-states when a structured scene graph is unavailable. To the best of our knowledge, this is the first work to introduce EQA in the context of situational queries and also the first to present a generative approach for query creation. We aim to foster research on improving the real-world usability of embodied agents through this work.

Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 8 figures, 2 tables)

This paper contains 16 sections, 2 equations, 8 figures, 2 tables.

Introduction
Related Work
Situational EQA: Generation & Validation
Situational Query Definition
Generating Situational Queries: S-EQA
Validating S-EQA: t-SNE plots and M-Turk Annotation
Situational EQA: Evaluation
VQA Evaluation:
LLM Evaluation:
Real World Experiments:
Results and Inferences
Simulator Setup
VQA Results
LLM Eval Results
Real World Experiments: Results and Discussion
...and 1 more sections

Figures (8)

Figure 1: Simple vs Situational EQA: We introduce Situational Embodied Question Answering (S-EQA), where queries require inspecting multiple objects and deriving consensus knowledge of their states. Using an LLM, we generate S-EQA on VirtualHome virtualhome, with consensus object states and relationships. A large-scale MTurk study annotates and verifies data authenticity. Evaluating LLMs on S-EQA reveals their strength in generating queries and consensus but misalignment in answering them. LLM explanations further suggests poor commonsense reasoning on our task.
Figure 2: Prompt-Generate-Evaluate (PGE) Scheme: We utilize an LLM (GPT-4) to generate situational datapoints comprised of queries, corresponding consensus object states and relationships. The data generation occurs over m iterations, refining LLM prompts after evaluating every batch of n datapoints. For each batch, BERT embeddings of the $n$ queries are computed and denoted as $Emb(Q_{n})$. A cosine similarity $\mathds{C}$ is then computed between $Emb(Q_{n})$ and $Emb(\mathds{Q}_{S})$, the embeddings of existing queries in the Situational Query Database$\mathds{Q}_S$. $\mathrm{C}$ is a threshold used to determine if the generated queries are sufficiently different from those in the database and if not, we continue conversation with the LLM (indicated with the red arrows). Concurrently, we cluster $Emb(\mathds{Q}_{S})$ into $\textbf{k}$ categories, selecting the query nearest to the centroid in each cluster for feedback, labeled as "Generated Queries" in the System Prompt (denoted by blue arrows). The ideal datapoint generation pathway is highlighted by the green arrows.
Figure 3: t-SNE plots of S-EQA: We categorize and analyze the generated data in 3 aspects. For Room categories, we observe that most of the generated datapoints ($83.5\%$) belong to a particular room, suggesting that the LLM tends to create room-specific queries as opposed to multi or no-room scenarios. For Situational Categories, we observe that $82.20\%$ of the data generated is situational according to our definition. For Spatio-Temporal Categories, we look at whether the questions can be answered by only using the spatial positioning of objects and infer that $77.96\%$ of the queries are spatial.
Figure 4: Influence of PGE: Notice the queries without PGE directly reference objects (microwave and computer) and are not situational. The queries are also just rephrased. PGE incorporates feedback measures to generate unique and diverse queries. Some queries may not hold general consensus on object states (such as Was there a break-in at the house?), and these are filtered via human validation.
Figure 5: S-EQA Simulator Eval: We evaluate S-EQA generated from VirtualHome using the LLM with Scene Graphs (Orange) and VQA (Green). In this example, the situational query asks if someone was working in the bedroom, to which the annotation was positive (Yes). For the LLM Eval, we pass the entire VirtualHome scene graph with modified consensus states along with the query as input to the LLM. Note the negative response of the LLM here, despite it generating valid consensus states. This mismatch is present in $\textbf{46.8\%}$ of cases, reflecting poor LLM answering capability. For VQA Eval, we query various VLMs with Room and Object images (See Table \ref{['tab:vqa-analysis']}). Note the simpler Object VQA queries that are answered correctly, suggesting breaking down situational queries into consensus ones can simplify the task via indirect answering.
...and 3 more figures

Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering

TL;DR

Abstract

Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (8)