Table of Contents
Fetching ...

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

Shiyi Ding, Shaoen Wu, Ying Chen

TL;DR

This work introduces ObjChangeVR-Dataset, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints, and proposes ObjChangeVR, a framework that significantly outperforms baseline approaches across multiple MLLMs.

Abstract

Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer's interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

TL;DR

This work introduces ObjChangeVR-Dataset, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints, and proposes ObjChangeVR, a framework that significantly outperforms baseline approaches across multiple MLLMs.

Abstract

Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer's interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.
Paper Structure (22 sections, 1 equation, 3 figures, 16 tables)

This paper contains 22 sections, 1 equation, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Illustration of the question-answering task for object state change reasoning. Given a query frame and a question about object change, we retrieve several relevant frames from the egocentric frame sequence and leverage visual evidence from the retrieved frames to produce an answer and an explanation.
  • Figure 2: Overview of the ObjChangeVR-Dataset and the proposed ObjChangeVR framework.
  • Figure 3: Proportion of questions (out of 5,000) with consistent and inconsistent intermediate answers across different $k$.