Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Shailaja Keyur Sampat; Maitreya Patel; Yezhou Yang; Chitta Baral

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Shailaja Keyur Sampat, Maitreya Patel, Yezhou Yang, Chitta Baral

TL;DR

This work presents a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system, and demonstrates comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead.

Abstract

An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird' can be identified by the presence of a beak, feathers, legs, wings, etc. Inspired by this aspect of human reasoning, in this work, we present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. Specifically, we prompt GPT-3 to obtain a rich linguistic description of visual objects in the dataset. We convert the obtained concept descriptions into a set of binary questions. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead, yet being fully explainable from the reasoning perspective.

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

TL;DR

Abstract

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)