Table of Contents
Fetching ...

ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

Shailaja Keyur Sampat, Yezhou Yang, Chitta Baral

TL;DR

This work proposes ActionCOMET, a zero-shot framework to discern knowledge present in language models specific to the provided visual input, and presents baseline results of ActionCOMET over the collected dataset and compares them with the performance of the best existing VQA approaches.

Abstract

Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potatoes being golden and crispy after frying), high-level goals associated with the action (e.g. beat the eggs to make an omelet) and reasoning about actions that possibly precede or follow the current action (e.g. crack eggs before whisking or draining pasta after boiling). Similar reasoning ability is highly desirable in autonomous systems that would assist us in performing everyday tasks. To that end, we propose a multi-modal task to learn aforementioned concepts about actions being performed in images. We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images, collected from an annotated cooking-video dataset. We propose ActionCOMET, a zero-shot framework to discern knowledge present in language models specific to the provided visual input. We present baseline results of ActionCOMET over the collected dataset and compare them with the performance of the best existing VQA approaches.

ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

TL;DR

This work proposes ActionCOMET, a zero-shot framework to discern knowledge present in language models specific to the provided visual input, and presents baseline results of ActionCOMET over the collected dataset and compares them with the performance of the best existing VQA approaches.

Abstract

Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potatoes being golden and crispy after frying), high-level goals associated with the action (e.g. beat the eggs to make an omelet) and reasoning about actions that possibly precede or follow the current action (e.g. crack eggs before whisking or draining pasta after boiling). Similar reasoning ability is highly desirable in autonomous systems that would assist us in performing everyday tasks. To that end, we propose a multi-modal task to learn aforementioned concepts about actions being performed in images. We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images, collected from an annotated cooking-video dataset. We propose ActionCOMET, a zero-shot framework to discern knowledge present in language models specific to the provided visual input. We present baseline results of ActionCOMET over the collected dataset and compare them with the performance of the best existing VQA approaches.

Paper Structure

This paper contains 15 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Fine-grained action-centric commonsense generation addressed in this work: (top-left) Input to the model: an image, the textual description of an image and an action-object pair of interest (top-right) 5 types of action-related inference that we are interested in predicting using a vision-language model: Effects, High-level Goals, Pre-conditions, Before Events and After Events (bottom) The overview of the proposed ActionCOMET model demonstrating how different inputs are processed and the inference is generated
  • Figure 2: Sample https://www.youtube.com/watch?v=4eWzsx1vAi8 from train+val partition of YouCook2 dataset with annotated procedure steps
  • Figure 3: Our data preparation pipeline that leverages YouCook2 ZhXuCoAAAI18 annotations to extract commonsense inference about actions (best viewed in color)