ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

Shiyi Ding; Shaoen Wu; Ying Chen

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

Shiyi Ding, Shaoen Wu, Ying Chen

TL;DR

This work introduces ObjChangeVR-Dataset, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints, and proposes ObjChangeVR, a framework that significantly outperforms baseline approaches across multiple MLLMs.

Abstract

Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer's interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 3 figures, 16 tables)

This paper contains 22 sections, 1 equation, 3 figures, 16 tables.

Introduction
Related Work
ObjChangeVR-Dataset
Methodology
Relevant Cross-view Frame Retrieval
Temporal Cross-view Reasoning
Experimental Setup
Baselines
Evaluation Metrics
Hyperparameters and Default Settings
Experimental Results
Conclusion
Appendix
Details of ObjChangeVR-Dataset
Comparison of ObjChangeVR-Dataset and Existing 3D and Video-based Benchmarks
...and 7 more sections

Figures (3)

Figure 1: Illustration of the question-answering task for object state change reasoning. Given a query frame and a question about object change, we retrieve several relevant frames from the egocentric frame sequence and leverage visual evidence from the retrieved frames to produce an answer and an explanation.
Figure 2: Overview of the ObjChangeVR-Dataset and the proposed ObjChangeVR framework.
Figure 3: Proportion of questions (out of 5,000) with consistent and inconsistent intermediate answers across different $k$.

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

TL;DR

Abstract

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (3)