Table of Contents
Fetching ...

OSCaR: Object State Captioning and State Change Representation

Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu

TL;DR

OSCaR introduces a novel dataset and benchmark for object state captioning and state-change reasoning in egocentric video. By combining diverse video sources, GPT-assisted data generation, and fine-tuned multimodal models, the work demonstrates the potential of natural language to express object states and their causal changes while revealing gaps in current MLLMs. The open-world and cooking-domain evaluations show promising generalization but also highlight significant room for improvement in accuracy and robustness. The study provides a scalable data-generation pipeline and a rigorous evaluation framework that can guide future research in visual reasoning and language grounding for dynamic object states.

Abstract

The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.

OSCaR: Object State Captioning and State Change Representation

TL;DR

OSCaR introduces a novel dataset and benchmark for object state captioning and state-change reasoning in egocentric video. By combining diverse video sources, GPT-assisted data generation, and fine-tuned multimodal models, the work demonstrates the potential of natural language to express object states and their causal changes while revealing gaps in current MLLMs. The open-world and cooking-domain evaluations show promising generalization but also highlight significant room for improvement in accuracy and robustness. The study provides a scalable data-generation pipeline and a rigorous evaluation framework that can guide future research in visual reasoning and language grounding for dynamic object states.

Abstract

The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.
Paper Structure (23 sections, 6 figures, 6 tables)

This paper contains 23 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Surpassing prior models in aligning with human judgements. Our method achieves near parity with GPT-4V ratings across helpfulness, accuracy, reasoning, and other key metrics.
  • Figure 2: OSCaR's description of state, state change, and illustration of reasoning. State description involves the characterization of a specific region of interest within the video and the associated activity. State change entails the description of the evolution of a system over a defined temporal sequence. Furthermore, the analysis of the state of an object is centered on comprehending and elucidating the mechanisms underlying the object's evolution.
  • Figure 3: Distribution of answer lengths. The figure shows how answers are distributed by length in the dataset. It separates short answers (1-9 words) from long answers ($\ge 10$ words). The histogram displays the number of answers on the y-axis based on increasing answer lengths on the x-axis. There is a category at 100 words for answers with lengths greater than or equal to 100 words. This breakdown emphasizes the balance between brief, direct answers and more detailed, explanatory responses.
  • Figure 4: Top 10 open-world domains (excluding cooking). The figure shows non-cooking domains present in the open-world test set used to assess model generalization. By evaluating performance on household and occupational activities unseen during training, we benchmark the trained models' capacity to understand new objects and actions beyond cooking tasks.
  • Figure 5: GPT-4V zero-shot caption quality human evaluation. The figure shows the distribution of quality ratings assigned by human annotators evaluating frame descriptions automatically generated by the GPT-4V model under zero-shot conditions. Descriptions for 500 video frames were rated.
  • ...and 1 more figures