Table of Contents
Fetching ...

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Shailaja Keyur Sampat, Maitreya Patel, Yezhou Yang, Chitta Baral

TL;DR

This work presents a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system, and demonstrates comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead.

Abstract

An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird' can be identified by the presence of a beak, feathers, legs, wings, etc. Inspired by this aspect of human reasoning, in this work, we present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. Specifically, we prompt GPT-3 to obtain a rich linguistic description of visual objects in the dataset. We convert the obtained concept descriptions into a set of binary questions. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead, yet being fully explainable from the reasoning perspective.

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

TL;DR

This work presents a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system, and demonstrates comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead.

Abstract

An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird' can be identified by the presence of a beak, feathers, legs, wings, etc. Inspired by this aspect of human reasoning, in this work, we present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. Specifically, we prompt GPT-3 to obtain a rich linguistic description of visual objects in the dataset. We convert the obtained concept descriptions into a set of binary questions. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead, yet being fully explainable from the reasoning perspective.

Paper Structure

This paper contains 23 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Zero-shot VQA task considered in this paper demonstrated using 'cardinal' object category from CUB wah2011caltech: provided an image and a question as input, the model has to predict 'Yes'/'No' answer depending on the presence or absence of the objects.
  • Figure 2: Overview of our proposed zero-shot method that uses LLM+VQA for concept learning: There are four key steps- (i) given an object category that needs to be verified in the image, we first query GPT-3 using a predefined prompt to obtain the concept descriptions for the object, (ii) concept descriptions returned by GPT-3 are turned into a set of binary meta-questions, (iii) test image along with each meta-question is posed to a VQA system, (iv) aggregate answers of all meta-questions to determine presence or absence of an object category in the image.
  • Figure 3: Two categories from the CUB dataset and their fine-grained concept descriptions generated by GPT-3 for m={1,3,5}.
  • Figure 4: Two qualitative examples predicted by the GPT-3+BLIP model, which is a top-performing zero-shot variant in our experiments.
  • Figure 5: Plots demonstrating how accuracy (on top), false positives (in mid), and false negatives (at the bottom) per object category change when more number of concept descriptions obtained using GPT-3 (i.e. when changing m=1 to m=3) are incorporated in our best zero-shot model GPT-3+BLIP.
  • ...and 2 more figures